ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

arxiv: 2605.15638 · v1 · pith:ILNMSBVXnew · submitted 2026-05-15 · 💻 cs.AR · cs.SE

ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

Ioanna Vavelidou , Subho S. Banerjee , Eric X. Liu , Mike Fuller , Subhasish Mitra , Caroline Trippel This is my paper

Pith reviewed 2026-05-19 19:49 UTC · model grok-4.3

classification 💻 cs.AR cs.SE

keywords silent data corruptiondefect detectionfunctional testingCPU reliabilityinstruction duplicationhyperscale serversintra-thread checkingsilent errors

0 comments p. Extension

pith:ILNMSBVX Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{ILNMSBVX}

Prints a linked pith:ILNMSBVX badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Intra-thread instruction duplication detects 39% more defective servers by catching inconsistent defect errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ITHICA as a way to automatically convert any program into a functional test for silicon defects that cause silent data corruptions. It works by adding checks that duplicate the same instruction inside one thread and compare the results. The method rests on the observation that the worst defects make an instruction produce different outputs for identical inputs depending on the surrounding execution context. This lets existing industrial programs, datacenter workloads, and libraries serve as stronger tests. Evaluation across thousands of servers shows the new checks find 39% more defective machines than native checks in the same programs and surface new patterns of defect behavior.

Core claim

ITHICA transforms arbitrary programs into tests for defect-induced silent data corruptions by inserting intra-thread, instruction-level error checks that primarily use instruction duplication and output comparison. The central insight is that the most pernicious defects cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. This enables identification of affected instructions upon error detection. When applied to industrial hyperscaler test programs, datacenter workloads, and common libraries and run on over 3,000 CPU servers, the ITHICA-

What carries the argument

Intra-thread instruction duplication and output comparison that exploits inconsistent errors to turn programs into tests and flag affected instructions.

If this is right

ITHICA checks derived from baseline industrial programs detect 39% more defective servers than native checks within the same tests.
Datacenter workloads and common libraries can be turned into functional tests for defect-induced errors.
Affected instructions are identified when an error is detected during test execution.
New observations about defect behavior emerge that differ from conclusions in prior hyperscaler fleet studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same duplication idea could be tried on GPUs or other accelerators where execution context might also trigger inconsistent faults.
Automated pipelines that insert these checks could become routine for screening new hardware batches before deployment.
Defect models used in reliability analysis may need to treat error behavior as context-dependent rather than fixed.

Load-bearing premise

The most pernicious defects cause inconsistent errors such that two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context.

What would settle it

Running duplicated instructions from an ITHICA test on a server already known to produce silent data corruptions and checking whether the two executions with identical inputs always yield identical outputs; consistent outputs would undermine the inconsistent-error premise.

Figures

Figures reproduced from arXiv: 2605.15638 by Caroline Trippel, Eric X. Liu, Ioanna Vavelidou, Mike Fuller, Subhasish Mitra, Subho S. Banerjee.

**Figure 1.** Figure 1: Classification of how hardware errors can manifest as three types of architectural errors [6, 45] (§3.1). ITHICA explicitly detects pernicious inconsistent errors and implicitly detects unresponsive errors. stimuli to logic circuits and inspect their outputs, leveraging fault models and test metrics for systematically generating scan tests [25, 28, 47, 49, 73, 76, 79, 81]. Functional testing complements… view at source ↗

**Figure 2.** Figure 2: Given an input program (<name>.cpp), ITHICA applies one or more transformations—implemented in this paper for LLVM IR— configured with some BlockSize and Interleaving, and outputs a functional test (<name>-ITHICA). Part two of our insight is that faults that induce inconsistent errors are harder to detect (more pernicious) than those that induce consistent errors. This is because consistent errors can be e… view at source ↗

**Figure 3.** Figure 3: EDR for different CC-ITHICA tests (with block size and interleaving of 1) and CC in DPool (D1–D14). Each triplet reports results for ITHICA (Ith), Native (Nat), and program crashes with no detection (Cr). For Ith and Nat, the subset of executions that crashed after a detection is shown in parentheses. D6* is uniquely detected by Arith for interleaving of 8 (§7.3) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Server detections across all CC-ITHICA runs in both pools. 6.1 Testing on the DPool First, to compare CC-ITHICA to CC, we run eight functional tests on each DPool server. Seven are CC-ITHICA tests obtained by applying its four main transformations (§4.1) and three combinations thereof (Arith+Mem, Arith+MemDiv, Arith+MemDiv+Br) on CC, all with block size and interleaving set to 1. The eighth is the origina… view at source ↗

**Figure 5.** Figure 5: shows the distribution of failing instructions and their average error frequencies across all CC-ITHICA tests run in the DPool with a block size of 1 and an interleaving of 1, except D6, which is uniquely detected at interleaving of 8 (§7.3). A “failing instruction” denotes one that exhibits an incorrect output; it does not imply a particular defective hardware unit, as discussed in §7.5. Cases where no sp… view at source ↗

**Figure 6.** Figure 6: Impact of interleaving and block size on EDR for each DPool server (D1 to D14), for CC-Arith. The top row shows the effect of varying interleaving (m=max, length of the basic block), while the bottom row shows the effect of varying block size (d=dep, length of instruction dependency chain). The rightmost panels show the average EDR across all servers. ITHICA Pass Arith (Block Size) Mem MemDiv Br (1) (2) (4… view at source ↗

**Figure 7.** Figure 7: Normalized execution frequency of failing opcodes for servers (columns) uniquely detected by one ITHICA program. Orange indicates the detecting program; gray indicates non-detecting programs executing the same opcode. tests, we select the most frequently failing one [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 9.** Figure 9: Failing instruction type combinations across ITHICA tests. Bar height: number of servers with errors in that combination. Bar segments: average per-server breakdown of errors by instruction type. Pie chart: same breakdown aggregated across all servers. electrical state (beyond architectural control, as discussed in §3.3) contribute to the manifestation of defect-induced errors. Finding 8: Reproduction of d… view at source ↗

**Figure 10.** Figure 10: Comparison of SiliFuzz, CC and ITHICA tests, across all commonly tested servers. Unique server detections for each test are shown in parentheses. hardware components, none of which is visible at the ISA-level. Moreover, the compiler’s mapping of LLVM IR to assembly, and the hardware’s mapping of assembly to micro-ops and micro-ops to functional units, introduce microarchitectural execution path non-determ… view at source ↗

read the original abstract

Hyperscaler reports of silent data corruptions (SDCs), presumed to be caused by silicon manufacturing defects, have motivated the development of functional tests for detecting defective CPUs. We present ITHICA, an approach for automatically generating functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight is that the most pernicious defects cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and identifies affected instructions upon error detections. We use ITHICA to transform industrial hyperscaler test programs (our baseline), datacenter workloads, and common libraries into functional tests, and evaluate them on over 3,000 CPU servers. ITHICA error checks detect 39% more defective servers than native checks within the ITHICA tests derived from our baseline programs, and enable novel findings on defect behavior that challenge conclusions drawn by prior hyperscaler fleet studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ITHICA turns existing programs into SDC tests via intra-thread duplication and reports 39% more detections on real servers, but the link to manufacturing defects lacks independent confirmation.

read the letter

The main point is that ITHICA generates functional tests for silent data corruptions by duplicating instructions inside the same thread and comparing outputs to catch inconsistencies. They applied this to baseline hyperscaler programs plus workloads and libraries, then ran the resulting tests on more than 3000 servers and found 39% more defective machines than the original checks alone picked up. That is the concrete result worth noting first.

Referee Report

2 major / 2 minor

Summary. The paper introduces ITHICA, an approach to automatically convert arbitrary programs into functional tests for defect-induced silent data corruptions (SDCs) by inserting intra-thread instruction-level checks, primarily via duplication of instructions and comparison of their architectural outputs. The core insight is that the most pernicious manufacturing defects produce inconsistent errors, such that the same instruction executed twice within the same thread on identical inputs can yield different outputs depending on execution context. The method is applied to industrial baseline test programs, datacenter workloads, and common libraries; these transformed tests are run on over 3,000 CPU servers. The evaluation reports that ITHICA checks detect 39% more defective servers than the native checks already present in the baseline-derived tests and yields new observations on defect behavior that challenge prior hyperscaler fleet studies.

Significance. If the attribution of observed inconsistencies to manufacturing defects is substantiated, the work would offer a practical, low-overhead way to leverage existing production programs for defect screening at hyperscale, potentially improving SDC mitigation and prompting re-examination of earlier fleet-study conclusions. The scale of the real-hardware deployment (thousands of servers) constitutes a concrete strength and supports reproducibility of the detection-rate measurements.

major comments (2)

[Abstract] Abstract and evaluation description: the 39% improvement in defective-server detections is presented as a central quantitative result, yet the manuscript provides no independent ground truth (physical failure analysis, controlled fault injection, or orthogonal detection method) to confirm that the additional inconsistencies are caused by permanent manufacturing defects rather than transient faults, environmental variation, or microarchitectural nondeterminism. This attribution is load-bearing for both the percentage claim and the challenge to prior studies.
[Evaluation] Methods and evaluation sections: the assumption that defects produce context-dependent inconsistent outputs for identical instructions and inputs is used to justify turning arbitrary programs into tests via duplication/comparison, but no validation experiments or controls are described that would rule out other sources of intra-thread output variation. Without such evidence the extra detections cannot be unambiguously credited to defects.

minor comments (2)

[Abstract] Abstract: the phrase 'over 3,000 CPU servers' should be replaced by the exact count and a brief statement of selection criteria.
[Throughout] Notation: ensure consistent use of 'SDC' after its first definition and clarify whether 'native checks' refers to existing hardware mechanisms or to the baseline program's own assertions.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments and the opportunity to clarify aspects of our work. We address each major comment below and have revised the manuscript where feasible to strengthen the attribution of results to manufacturing defects.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the 39% improvement in defective-server detections is presented as a central quantitative result, yet the manuscript provides no independent ground truth (physical failure analysis, controlled fault injection, or orthogonal detection method) to confirm that the additional inconsistencies are caused by permanent manufacturing defects rather than transient faults, environmental variation, or microarchitectural nondeterminism. This attribution is load-bearing for both the percentage claim and the challenge to prior studies.

Authors: We acknowledge that direct ground truth such as physical failure analysis would provide stronger confirmation. However, such analysis is impractical at the scale of over 3,000 production servers due to cost, time, and the need to maintain fleet availability. We instead rely on the repeatability of inconsistencies across repeated executions on the same servers and the context-dependent error pattern, which aligns with known defect behaviors cited in the paper. Transient faults and environmental factors are mitigated by our experimental design of multiple runs per test. We will add a dedicated paragraph in the evaluation section discussing alternative explanations and why the observed patterns are most consistent with permanent defects. revision: partial
Referee: [Evaluation] Methods and evaluation sections: the assumption that defects produce context-dependent inconsistent outputs for identical instructions and inputs is used to justify turning arbitrary programs into tests via duplication/comparison, but no validation experiments or controls are described that would rule out other sources of intra-thread output variation. Without such evidence the extra detections cannot be unambiguously credited to defects.

Authors: The assumption draws from established CPU defect literature on intermittent and context-sensitive errors, which we reference. To strengthen this, we will include new control experiments in the revised evaluation: running the duplicated instruction sequences on a set of known-good servers to quantify baseline variation from microarchitectural sources, and reporting that detected inconsistencies are persistent rather than sporadic. This supports crediting the additional detections to defects while acknowledging that complete isolation of all nondeterministic sources remains challenging without hardware-level instrumentation. revision: yes

standing simulated objections not resolved

Independent physical failure analysis or controlled fault injection at hyperscale to provide definitive ground truth for all detected servers

Circularity Check

0 steps flagged

No significant circularity; empirical hardware results independent of self-referential inputs

full rationale

The paper presents ITHICA as a method to generate tests from arbitrary programs by exploiting an assumed key insight on defect-induced inconsistent errors. The central quantitative claim (39% more detections) is obtained by executing the generated tests on over 3,000 real CPU servers and comparing against native checks within the same tests. No equations, fitted parameters, or derived predictions are described that reduce the reported detection improvement to a quantity defined by the paper's own inputs or prior self-citations. The evaluation uses external hardware benchmarks, satisfying the criterion for a self-contained result against external measurements. A minor score of 2 accounts for the normal presence of an unverified modeling assumption without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pernicious defects produce inconsistent outputs on repeated identical instructions within one thread; no free parameters or invented entities are described.

axioms (1)

domain assumption Most pernicious defects cause inconsistent errors where the same instruction with same inputs produces different outputs depending on execution context
This is explicitly stated as the key insight enabling the test generation approach.

pith-pipeline@v0.9.0 · 5748 in / 1188 out tokens · 40744 ms · 2026-05-19T19:49:05.506826+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our key insight... the most pernicious defects cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ITHICA error checks detect 39% more defective servers than native checks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 1 internal anchor

[1]

Andreas Abel, Yuying Li, Richard O’Grady, Chris Kennelly, and Darryl Gove. 2024. A Profiling-Based Benchmark Suite for Warehouse-Scale Computers. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 325–327. https://doi.org/10.1109/ISPASS61541.2024.00046

work page doi:10.1109/ispass61541.2024.00046 2024
[2]

Andreas Abel and Jan Reineke. 2019. uops.info: Characterizing Latency, Through- put, and Port Usage of Instructions on Intel Microarchitectures. InASPLOS (Providence, RI, USA)(ASPLOS ’19). ACM, New York, NY, USA, 673–686. https: //doi.org/10.1145/3297858.3304062

work page doi:10.1145/3297858.3304062 2019
[3]

abseil 2024. Abseil. https://github.com/abseil/abseil-cpp

work page 2024
[4]

Paul, Ming Zhang, and Subhasish Mitra

Mridul Agarwal, Bipul C. Paul, Ming Zhang, and Subhasish Mitra. 2007. Circuit Failure Prediction and Its Application to Transistor Aging. In25th IEEE VLSI Test Symposium (VTS’07). 277–286. https://doi.org/10.1109/VTS.2007.22

work page doi:10.1109/vts.2007.22 2007
[5]

Chang, Chao-Wen Tseng, Chien-Mo James Li, Mike Purtell, and Edward Joseph McCluskey

Jonathan T.-Y. Chang, Chao-Wen Tseng, Chien-Mo James Li, Mike Purtell, and Edward Joseph McCluskey. 1998. Analysis of pattern-dependent and timing-dependent failures in an experimental test chip.Proceedings Interna- tional Test Conference 1998 (IEEE Cat. No.98CH36270)(1998), 184–193. https: //api.semanticscholar.org/CorpusID:16286356

work page 1998
[6]

D’Agostino, Ioanna Vavelidou, Vijay D

Saranyu Chattopadhyay, Keerthikumara Devarajegowda, Bihan Zhao, Florian Lonsing, Brandon A. D’Agostino, Ioanna Vavelidou, Vijay D. Bhatt, Sebastian Prebeck, Wolfgang Ecker, Caroline Trippel, Clark Barrett, and Subhasish Mitra

work page
[7]

In2023 60th ACM/IEEE Design Automation Conference (DAC)

G-QED: Generalized QED Pre-silicon Verification beyond Non-Interfering Hardware Accelerators. In2023 60th ACM/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1109/DAC56929.2023.10247903

work page doi:10.1109/dac56929.2023.10247903 2023
[8]

Kulkarni

Odysseas Chatzopoulos, Nikos Karystinos, George Papadimitriou, Dimitris Gi- zopoulos, Harish D. Dixit, and Sriram Sankar. 2025. Veritas - Demystifying Silent Data Corruptions: uArch-Level Modeling and Fleet Data of Modern x86 CPUs. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1–14. https://doi.org/10.1109/HPCA6190...

work page doi:10.1109/hpca61900.2025.00012 2025
[9]

Odysseas Chatzopoulos, George Papadimitriou, Dimitris Gizopoulos, Harish D Dixit, and Sriram Sankar. 2025. From gates to sdcs: Understanding fault propa- gation through the compute stack. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7

work page 2025
[10]

Tze Wee Chen, Kyunglok Kim, Young Moon Kim, and Subhasish Mitra. 2008. Gate-Oxide Early Life Failure Prediction. In26th IEEE VLSI Test Symposium (vts 2008). 111–118. https://doi.org/10.1109/VTS.2008.55

work page doi:10.1109/vts.2008.55 2008
[11]

Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R

Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2016. CLEAR: Cross-Layer Exploration for Architecting Resilience - Combining hardware and software techniques to tolerate soft errors in processor cores. InProceedings of the 53rd A...

work page doi:10.1145/2897937.2897996 2016
[12]

Peter Deutsch, Harish Dixit, Gautham Vunnam, Carl Moran, Eleanor Ozer, and Sriram Sankar. 2026. PinDrop: Breaking the Silence on SDCs in a Large-Scale Fleet. 1–14. https://doi.org/10.1109/HPCA68181.2026.11408620

work page doi:10.1109/hpca68181.2026.11408620 2026
[13]

Deutsch, Vincent Quentin Ulitzsch, Sudhanva Gurumurthi, Vilas Srid- haran, Joel S

Peter W. Deutsch, Vincent Quentin Ulitzsch, Sudhanva Gurumurthi, Vilas Srid- haran, Joel S. Emer, and Mengjia Yan. 2024. DelayAVF: Calculating Architectural Vulnerability Factors for Delay Faults. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 231–245. https://doi.org/10.1109/ MICRO61859.2024.00026

work page arXiv 2024
[14]

Moslem Didehban and Aviral Shrivastava. 2016. nZDC: A compiler technique for near Zero Silent Data Corruption. In2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1145/2897937.2898054

work page doi:10.1145/2897937.2898054 2016
[15]

Harish Dattatraya Dixit, Laura Boyle, Gautham Vunnam, Sneha Pendharkar, Matt Beadon, and Sriram Sankar. 2022. Detecting silent data corruptions in the wild. arXiv:2203.08989 [cs.AR] https://arxiv.org/abs/2203.08989

work page arXiv 2022
[16]

Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. 2021. Silent Data Corrup- tions at Scale.CoRRabs/2102.11245 (2021). https://arxiv.org/abs/2102.11245

work page arXiv 2021
[17]

2025.Hardware Sentinel: Protecting Software Applications from Hardware Silent Data Corruptions

Rhea Dutta, Harish Dattatraya Dixit, Rik Van Riel, Gautham Vunnam, and Sriram Sankar. 2025.Hardware Sentinel: Protecting Software Applications from Hardware Silent Data Corruptions. Association for Computing Machinery, New York, NY, USA, 482–497. https://doi.org/10.1145/3676641.3716258

work page doi:10.1145/3676641.3716258 2025
[18]

E. B. Eichelberger and T. W. Williams. 1988. A logic design structure for LSI testability. InPapers on Twenty-Five Years of Electronic Design Automation (25 years of DAC). Association for Computing Machinery, New York, NY, USA, 358–364. https://doi.org/10.1145/62882.62924

work page doi:10.1145/62882.62924 1988
[19]

Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: probabilistic soft error reliability on the cheap.ACM SIGPLAN Notices45 (03 2010), 385. https://doi.org/10.1145/1735971.1736063

work page doi:10.1145/1735971.1736063 2010
[20]

Nikos Foutris, Dimitris Gizopoulos, Mihalis Psarakis, Xavier Vera, and An- tonio Gonzalez. 2011. Accelerating microprocessor silicon validation by ex- posing ISA diversity. InProceedings of the 44th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture(Porto Alegre, Brazil)(MICRO-44). As- sociation for Computing Machinery, New York, NY, USA, 386–...

work page doi:10.1145/2155620.2155666 2011
[21]

Chappell

Nishant George, Sudhanva Gurumurthi, Vilas Sridharan, Harish Dattatraya Dixit, Emel Goksu, Bharath Parthasarathy, Amber Huffman, Thiago Macieira, Arani Sinha, Dean Liberty, Lisa Minwell, and Robert S. Chappell. 2025. Silent Data Corruption in AI: A Growing Challenge for Large-Scale Machine Learning.IEEE Micro(2025), 1–7. https://doi.org/10.1109/MM.2025.3645670

work page doi:10.1109/mm.2025.3645670 2025
[22]

Dixit, and Sriram Sankar

Dimitris Gizopoulos, George Papadimitriou, Odysseas Chatzopoulos, Nikos Karystinos, Harish D. Dixit, and Sriram Sankar. 2024. Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements. In2024 IEEE European Test Symposium (ETS). 1–10. https://doi.org/10.1109/ETS61313. 2024.10567770

work page doi:10.1109/ets61313 2024
[23]

Google. 2020. Google cpu-check torture test. https://github.com/google/cpu- check

work page 2020
[24]

Google. 2021. Silifuzz. https://github.com/google/silifuzz

work page 2021
[25]

Google. 2022. Fleetbench. https://github.com/google/fleetbench

work page 2022
[26]

Hapke, R

F. Hapke, R. Krenz-Baath, A. Glowatz, J. Schloeffel, H. Hashempour, S. Eichen- berger, C. Hora, and D. Adolfsson. 2009. Defect-oriented cell-aware ATPG and fault simulation for industrial cell libraries and designs. In2009 International Test Conference

work page 2009
[27]

Zhengyang He, Yafan Huang, Hui Xu, Dingwen Tao, and Guanpeng Li. 2023. Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(Denver, CO, USA) (SC ’23). Association for Computing Machinery, New Y...

work page doi:10.1145/3581784.3607078 2023
[28]

Zhengyang He, Hui Xu, and Guanpeng Li. 2024. A Fast Low-Level Error Detection Technique. In2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 90–98. https://doi.org/10.1109/DSN58291.2024. 00023

work page doi:10.1109/dsn58291.2024 2024
[29]

Heragu, J.H

K. Heragu, J.H. Patel, and V.D. Agrawal. 1996. Segment delay faults: a new fault model. InProceedings of 14th VLSI Test Symposium

work page 1996
[30]

Hochschild, Paul Turner, Jeffrey C

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat. 2021. Cores That Don’t Count. InProceedings of the Workshop on Hot Topics in Operating Systems

work page 2021
[31]

Gardner, and Subhasish Mitra

Ted Hong, Yanjing Li, Sung-Boem Park, Diana Mui, David Lin, Ziyad Abdel Kaleq, Nagib Hakim, Helia Naeimi, Donald S. Gardner, and Subhasish Mitra

work page
[32]

In 2010 IEEE International Test Conference

QED: Quick Error Detection tests for effective post-silicon validation. In 2010 IEEE International Test Conference

work page 2010
[33]

Deutsch, Vincent Quentin Ulitzsch, Sudhanva Gurumurthi, Vilas Srid- haran, Joel S

Yao Hsiao, Nikos Nikoleris, Artem Khyzha, Dominic P. Mulligan, Gustavo Petri, Christopher W. Fletcher, and Caroline Trippel. 2024. RTL2M 𝜇PATH: Multi- 𝜇PATH Synthesis with Applications to Hardware Security Verification. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 507–524. https://doi.org/10.1109/MICRO61859.2024.00045

work page doi:10.1109/micro61859.2024.00045 2024
[34]

Yafan Huang, Shengjian Guo, Sheng Di, Guanpeng Li, and Franck Cappello. 2022. Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs. InSC22: International Conference for High Performance Computing, Net- working, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41404.2022.00022

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00022 2022
[35]

Intel. 2021. OpenDCDiag. https://github.com/opendcdiag

work page 2021
[36]

Nikos Karystinos, Odysseas Chatzopoulos, George-Marios Fragkoulis, George Pa- padimitriou, Dimitris Gizopoulos, and Sudhanva Gurumurthi. 2024. Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 516–531. https://doi.org/10.1109...

work page doi:10.1109/isca59077.2024.00045 2024
[37]

Nikos Karystinos, George-Marios Fragkoulis, Odysseas Chatzopoulos, Dimitris Gizopoulos, and Sudhanva Gurumurthi. 2025. Harpocrates++: Automated Func- tional Program Generation against CPU Faults and Silent Data Corruptions.IEEE Micro(2025), 1–9. https://doi.org/10.1109/MM.2025.3640385

work page doi:10.1109/mm.2025.3640385 2025
[38]

Kundu, S

S. Kundu, S. Sengupta, and R. Galivanche. 2000. Test challenges in nanometer technologies. InProceedings IEEE European Test Workshop

work page 2000
[39]

2021.SiliFuzz: Fuzzing CPUs by proxy

Doug Kwan, Kostik Shtoyk, Kostya Serebryany, Maxim L Lifantsev, and Peter Hochschild. 2021.SiliFuzz: Fuzzing CPUs by proxy. Technical Report. Google

work page 2021
[40]

Li and E.J

J.C.-M. Li and E.J. McCluskey. 2002. Diagnosis of sequence-dependent chips. InProceedings 20th IEEE VLSI Test Symposium (VTS 2002). 187–192. https: //doi.org/10.1109/VTS.2002.1011137

work page doi:10.1109/vts.2002.1011137 2002
[41]

Wei Li, Chris Nigh, Danielle Duvalsaint, Subhasish Mitra, and R. D. Blanton. 2022. PEPR: Pseudo-Exhaustive Physically-Aware Region Testing. InInternational Test Conference

work page 2022
[42]

The LLVM C Library

libcllvm 2024. The LLVM C Library. https://libc.llvm.org/

work page 2024
[43]

LLVM libc++

libcxxllvm 2024. LLVM libc++. https://github.com/llvm/llvm- project/blob/main/libcxx/include/concepts

work page 2024
[44]

David Lin, Ted Hong, Yanjing Li, Farzan Fallah, Donald S Gardner, Nagib Hakim, and Subhasish Mitra. 2013. Overcoming post-silicon validation challenges through quick error detection (QED). In2013 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 320–325

work page 2013
[45]

Gardner, and Subhasish Mitra

David Lin, Ted Hong, Yanjing Li, Eswaran S, Sharad Kumar, Farzan Fallah, Nagib Hakim, Donald S. Gardner, and Subhasish Mitra. 2014. Effective Post-Silicon Validation of System-on-Chips Using Quick Error Detection.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems33, 10 (2014), 1573–1590. https://doi.org/10.1109/TCAD.2014.2334301...

work page doi:10.1109/tcad.2014.2334301 2014
[46]

LLVM Language Reference Manual

llvm-language-ref 2022. LLVM Language Reference Manual. https://llvm.org/ docs/LangRef.html. Accessed: 2022-10-19

work page 2022
[47]

Florian Lonsing, Subhasish Mitra, and Clark W. Barrett. 2020. A Theoretical Framework for Symbolic Quick Error Detection. In2020 Formal Methods in Computer Aided Design, FMCAD 2020, Haifa, Israel, September 21-24, 2020. IEEE, 1–10. https://doi.org/10.34727/2020/ISBN.978-3-85448-042-6_9

work page doi:10.34727/2020/isbn.978-3-85448-042-6_9 2020
[48]

Jiacheng Ma, Majd Ganaiem, Madeline Burbage, Theo Gregersen, Rachel McAmis, Freddy Gabbay, and Baris Kasikci. 2025. Proactive Runtime Detection of Aging- Related Silent Data Corruptions: A Bottom-Up Approach. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4(Hilton La ...

work page doi:10.1145/3622781.3674182 2025
[49]

S.C. Ma, P. Franco, and E.J. McCluskey. 1995. An experimental chip to evaluate test techniques experiment results. InProceedings of 1995 IEEE International Test Conference (ITC)

work page 1995
[50]

May and Murray H

Timothy C. May and Murray H. Woods. 1978. A New Physical Mechanism for Soft Errors in Dynamic Memories. In16th International Reliability Physics Symposium. 33–40. https://doi.org/10.1109/IRPS.1978.362815

work page doi:10.1109/irps.1978.362815 1978
[51]

McCluskey

E.J. McCluskey. 1993. Quality and single-stuck faults. InProceedings of IEEE International Test Conference - (ITC)

work page 1993
[52]

McCluskey and Chao-Wen Tseng

E.J. McCluskey and Chao-Wen Tseng. 2000. Stuck-fault tests vs. actual defects. InProceedings International Test Conference 2000 (IEEE Cat. No.00CH37159)

work page 2000
[53]

Yixuan Mei, Shreya Varshini, Harish Dixit, Sriram Sankar, and K. V. Rashmi. 2026. SEVI: Silent Data Corruption of Vector Instructions in Hyper-Scale Datacenters. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(USA)(ASPLOS ’26). Association for Computing Machinery, Ne...

work page doi:10.1145/3779212.3790217 2026
[54]

Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. Revisiting Mem- ory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field. In2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 415–426. https://doi.org/10.1109/DSN. 2015.57

work page doi:10.1109/dsn 2015
[55]

Liu, Bharath Parthasarathy, and Parthasarathy Ranganathan

Subhasish Mitra, Subho Banerjee, Martin Dixon, Rama Govindaraju, Peter Hochschild, Eric X. Liu, Bharath Parthasarathy, and Parthasarathy Ranganathan

work page
[56]

arXiv:2508.01786 [cs.AR] https://arxiv.org/abs/2508.01786

Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing. arXiv:2508.01786 [cs.AR] https://arxiv.org/abs/2508.01786

work page arXiv
[57]

Mukherjee

S. Mukherjee. 2008.Architecture Design for Soft Errors. https://doi.org/10.1016/ B978-0-12-369529-1.X5001-0

work page 2008
[58]

Mukherjee, J

S.S. Mukherjee, J. Emer, and S.K. Reinhardt. 2005. The soft error problem: an architectural perspective. In11th International Symposium on High-Performance Computer Architecture

work page 2005
[59]

Mukherjee, C

S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin. 2003. A sys- tematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. InProceedings. 36th Annual IEEE/ACM In- ternational Symposium on Microarchitecture, 2003. MICRO-36.29–40. https: //doi.org/10.1109/MICRO.2003.1253181

work page doi:10.1109/micro.2003.1253181 2003
[60]

N. Oh, S. Mitra, and E.J. McCluskey. 2002. ED4I: error detection by diverse data and duplicated instructions.IEEE Trans. Comput.51, 2 (2002), 180–199. https://doi.org/10.1109/12.980007

work page doi:10.1109/12.980007 2002
[61]

N. Oh, P.P. Shirvani, and E.J. McCluskey. 2002. Control-flow checking by software signatures.IEEE Transactions on Reliability51, 1 (2002), 111–122. https://doi. org/10.1109/24.994926

work page doi:10.1109/24.994926 2002
[62]

Nahmsuk Oh, Philip Shirvani, and Edward McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors.IEEE Transactions on Reliability51, 1 (2002), 63–75. https://doi.org/10.1109/24.994913

work page doi:10.1109/24.994913 2002
[63]

OpenHW Group. 2019. CVA6 RISC-V CPU. https://github.com/openhwgroup/ cva6

work page 2019
[64]

openssl 2024. OpenSSL. https://github.com/openssl/openssl

work page 2024
[65]

openssl-manual [n. d.]. OPENSSL Debian Manpages. https://manpages.debian. org/testing/libssl-doc/OPENSSL_LH_doall_arg.3ssl.en.html

work page
[66]

George Papadimitriou and Dimitris Gizopoulos. 2023. AVGI: Microarchitecture- Driven, Fast and Accurate Vulnerability Assessment. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 935–948. https: //doi.org/10.1109/HPCA56546.2023.10071105

work page doi:10.1109/hpca56546.2023.10071105 2023
[67]

George Papadimitriou and Dimitris Gizopoulos. 2023. Silent Data Corruptions: Microarchitectural Perspectives.IEEE Trans. Comput.72, 11 (2023), 3072–3085. https://doi.org/10.1109/TC.2023.3285094

work page doi:10.1109/tc.2023.3285094 2023
[68]

George Papadimitriou, Dimitris Gizopoulos, Harish Dattatraya Dixit, and Sriram Sankar. 2023. Silent Data Corruptions: The Stealthy Saboteurs of Digital Integrity. 2023 IEEE 29th International Symposium on On-Line Testing and Robust System Design (IOLTS)(2023), 1–7. https://api.semanticscholar.org/CorpusID:261315246

work page 2023
[69]

Priyadarsan Patra. 2007. On the cusp of a validation wall.IEEE Design & Test of Computers24, 2 (2007), 193–196. https://doi.org/10.1109/MDT.2007.54

work page doi:10.1109/mdt.2007.54 2007
[70]

Paul, Kunhyuk Kang, Haldun Kufluoglu, Muhammad A

Bipul C. Paul, Kunhyuk Kang, Haldun Kufluoglu, Muhammad A. Alam, and Kaushik Roy. 2007. Negative Bias Temperature Instability: Estimation and Design for Improved Reliability of Nanoscale Circuits.IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems26, 4 (2007), 743–751. https: //doi.org/10.1109/TCAD.2006.884870

work page doi:10.1109/tcad.2006.884870 2007
[71]

Mahesh Prabhu and Jacob A. Abraham. 2012. Functional test generation for hard to detect stuck-at faults using RTL model checking. In2012 17th IEEE European Test Symposium (ETS)

work page 2012
[72]

G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August. 2005. SWIFT: software implemented fault tolerance. InInternational Symposium on Code Gen- eration and Optimization. 243–254. https://doi.org/10.1109/CGO.2005.34

work page doi:10.1109/cgo.2005.34 2005
[73]

Matthias Sauer, Young Moon Kim, Jun Seomun, Hyung-Ock Kim, Kyung-Tae Do, Jung Yun Choi, Kee Sup Kim, Subhasish Mitra, and Bernd Becker. 2013. Early-life-failure detection using SAT-based ATPG. In2013 IEEE International Test Conference (ITC). 1–10. https://doi.org/10.1109/TEST.2013.6651925

work page doi:10.1109/test.2013.6651925 2013
[74]

Jian Shen and Jacob A. Abraham. 1998. Native mode functional test generation for processors with applications to self test and design validation.Proceedings International Test Conference 1998 (IEEE Cat. No.98CH36270)(1998), 990–999. https://api.semanticscholar.org/CorpusID:14132281

work page 1998
[75]

Barrett, and Subhasish Mitra

Eshan Singh, Clark W. Barrett, and Subhasish Mitra. 2017. E-QED: Electrical Bug Localization During Post-silicon Validation Enabled by Quick Error Detection and Formal Methods. (2017)

work page 2017
[76]

Gordon L. Smith. 1985. Model for Delay Faults Based upon Paths. InInternational Test Conference

work page 1985
[77]

Wilson Snyder, Paul Wasson, and Duane Galbi et al. [n. d.].Verilator. https: //verilator.org

work page
[78]

2009.Fault Tolerant Computer Architecture

Daniel Sorin. 2009.Fault Tolerant Computer Architecture. Vol. 4. https://doi.org/ 10.2200/S00192ED1V01Y200904CAC005

work page doi:10.2200/s00192ed1v01y200904cac005 2009
[79]

Storey and W

T.M. Storey and W. Maly. 1990. CMOS bridging fault detection. InProceedings. International Test Conference 1990

work page 1990
[80]

Takeda and N

E. Takeda and N. Suzuki. 1983. An empirical model for device degradation due to hot-carrier injection.IEEE Electron Device Letters4, 4 (1983), 111–113. https://doi.org/10.1109/EDL.1983.25667

work page doi:10.1109/edl.1983.25667 1983

Showing first 80 references.

[1] [1]

Andreas Abel, Yuying Li, Richard O’Grady, Chris Kennelly, and Darryl Gove. 2024. A Profiling-Based Benchmark Suite for Warehouse-Scale Computers. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 325–327. https://doi.org/10.1109/ISPASS61541.2024.00046

work page doi:10.1109/ispass61541.2024.00046 2024

[2] [2]

Andreas Abel and Jan Reineke. 2019. uops.info: Characterizing Latency, Through- put, and Port Usage of Instructions on Intel Microarchitectures. InASPLOS (Providence, RI, USA)(ASPLOS ’19). ACM, New York, NY, USA, 673–686. https: //doi.org/10.1145/3297858.3304062

work page doi:10.1145/3297858.3304062 2019

[3] [3]

abseil 2024. Abseil. https://github.com/abseil/abseil-cpp

work page 2024

[4] [4]

Paul, Ming Zhang, and Subhasish Mitra

Mridul Agarwal, Bipul C. Paul, Ming Zhang, and Subhasish Mitra. 2007. Circuit Failure Prediction and Its Application to Transistor Aging. In25th IEEE VLSI Test Symposium (VTS’07). 277–286. https://doi.org/10.1109/VTS.2007.22

work page doi:10.1109/vts.2007.22 2007

[5] [5]

Chang, Chao-Wen Tseng, Chien-Mo James Li, Mike Purtell, and Edward Joseph McCluskey

Jonathan T.-Y. Chang, Chao-Wen Tseng, Chien-Mo James Li, Mike Purtell, and Edward Joseph McCluskey. 1998. Analysis of pattern-dependent and timing-dependent failures in an experimental test chip.Proceedings Interna- tional Test Conference 1998 (IEEE Cat. No.98CH36270)(1998), 184–193. https: //api.semanticscholar.org/CorpusID:16286356

work page 1998

[6] [6]

D’Agostino, Ioanna Vavelidou, Vijay D

Saranyu Chattopadhyay, Keerthikumara Devarajegowda, Bihan Zhao, Florian Lonsing, Brandon A. D’Agostino, Ioanna Vavelidou, Vijay D. Bhatt, Sebastian Prebeck, Wolfgang Ecker, Caroline Trippel, Clark Barrett, and Subhasish Mitra

work page

[7] [7]

In2023 60th ACM/IEEE Design Automation Conference (DAC)

G-QED: Generalized QED Pre-silicon Verification beyond Non-Interfering Hardware Accelerators. In2023 60th ACM/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1109/DAC56929.2023.10247903

work page doi:10.1109/dac56929.2023.10247903 2023

[8] [8]

Kulkarni

Odysseas Chatzopoulos, Nikos Karystinos, George Papadimitriou, Dimitris Gi- zopoulos, Harish D. Dixit, and Sriram Sankar. 2025. Veritas - Demystifying Silent Data Corruptions: uArch-Level Modeling and Fleet Data of Modern x86 CPUs. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1–14. https://doi.org/10.1109/HPCA6190...

work page doi:10.1109/hpca61900.2025.00012 2025

[9] [9]

Odysseas Chatzopoulos, George Papadimitriou, Dimitris Gizopoulos, Harish D Dixit, and Sriram Sankar. 2025. From gates to sdcs: Understanding fault propa- gation through the compute stack. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7

work page 2025

[10] [10]

Tze Wee Chen, Kyunglok Kim, Young Moon Kim, and Subhasish Mitra. 2008. Gate-Oxide Early Life Failure Prediction. In26th IEEE VLSI Test Symposium (vts 2008). 111–118. https://doi.org/10.1109/VTS.2008.55

work page doi:10.1109/vts.2008.55 2008

[11] [11]

Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R

Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2016. CLEAR: Cross-Layer Exploration for Architecting Resilience - Combining hardware and software techniques to tolerate soft errors in processor cores. InProceedings of the 53rd A...

work page doi:10.1145/2897937.2897996 2016

[12] [12]

Peter Deutsch, Harish Dixit, Gautham Vunnam, Carl Moran, Eleanor Ozer, and Sriram Sankar. 2026. PinDrop: Breaking the Silence on SDCs in a Large-Scale Fleet. 1–14. https://doi.org/10.1109/HPCA68181.2026.11408620

work page doi:10.1109/hpca68181.2026.11408620 2026

[13] [13]

Deutsch, Vincent Quentin Ulitzsch, Sudhanva Gurumurthi, Vilas Srid- haran, Joel S

Peter W. Deutsch, Vincent Quentin Ulitzsch, Sudhanva Gurumurthi, Vilas Srid- haran, Joel S. Emer, and Mengjia Yan. 2024. DelayAVF: Calculating Architectural Vulnerability Factors for Delay Faults. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 231–245. https://doi.org/10.1109/ MICRO61859.2024.00026

work page arXiv 2024

[14] [14]

Moslem Didehban and Aviral Shrivastava. 2016. nZDC: A compiler technique for near Zero Silent Data Corruption. In2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1145/2897937.2898054

work page doi:10.1145/2897937.2898054 2016

[15] [15]

Harish Dattatraya Dixit, Laura Boyle, Gautham Vunnam, Sneha Pendharkar, Matt Beadon, and Sriram Sankar. 2022. Detecting silent data corruptions in the wild. arXiv:2203.08989 [cs.AR] https://arxiv.org/abs/2203.08989

work page arXiv 2022

[16] [16]

Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. 2021. Silent Data Corrup- tions at Scale.CoRRabs/2102.11245 (2021). https://arxiv.org/abs/2102.11245

work page arXiv 2021

[17] [17]

2025.Hardware Sentinel: Protecting Software Applications from Hardware Silent Data Corruptions

Rhea Dutta, Harish Dattatraya Dixit, Rik Van Riel, Gautham Vunnam, and Sriram Sankar. 2025.Hardware Sentinel: Protecting Software Applications from Hardware Silent Data Corruptions. Association for Computing Machinery, New York, NY, USA, 482–497. https://doi.org/10.1145/3676641.3716258

work page doi:10.1145/3676641.3716258 2025

[18] [18]

E. B. Eichelberger and T. W. Williams. 1988. A logic design structure for LSI testability. InPapers on Twenty-Five Years of Electronic Design Automation (25 years of DAC). Association for Computing Machinery, New York, NY, USA, 358–364. https://doi.org/10.1145/62882.62924

work page doi:10.1145/62882.62924 1988

[19] [19]

Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: probabilistic soft error reliability on the cheap.ACM SIGPLAN Notices45 (03 2010), 385. https://doi.org/10.1145/1735971.1736063

work page doi:10.1145/1735971.1736063 2010

[20] [20]

Nikos Foutris, Dimitris Gizopoulos, Mihalis Psarakis, Xavier Vera, and An- tonio Gonzalez. 2011. Accelerating microprocessor silicon validation by ex- posing ISA diversity. InProceedings of the 44th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture(Porto Alegre, Brazil)(MICRO-44). As- sociation for Computing Machinery, New York, NY, USA, 386–...

work page doi:10.1145/2155620.2155666 2011

[21] [21]

Chappell

Nishant George, Sudhanva Gurumurthi, Vilas Sridharan, Harish Dattatraya Dixit, Emel Goksu, Bharath Parthasarathy, Amber Huffman, Thiago Macieira, Arani Sinha, Dean Liberty, Lisa Minwell, and Robert S. Chappell. 2025. Silent Data Corruption in AI: A Growing Challenge for Large-Scale Machine Learning.IEEE Micro(2025), 1–7. https://doi.org/10.1109/MM.2025.3645670

work page doi:10.1109/mm.2025.3645670 2025

[22] [22]

Dixit, and Sriram Sankar

Dimitris Gizopoulos, George Papadimitriou, Odysseas Chatzopoulos, Nikos Karystinos, Harish D. Dixit, and Sriram Sankar. 2024. Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements. In2024 IEEE European Test Symposium (ETS). 1–10. https://doi.org/10.1109/ETS61313. 2024.10567770

work page doi:10.1109/ets61313 2024

[23] [23]

Google. 2020. Google cpu-check torture test. https://github.com/google/cpu- check

work page 2020

[24] [24]

Google. 2021. Silifuzz. https://github.com/google/silifuzz

work page 2021

[25] [25]

Google. 2022. Fleetbench. https://github.com/google/fleetbench

work page 2022

[26] [26]

Hapke, R

F. Hapke, R. Krenz-Baath, A. Glowatz, J. Schloeffel, H. Hashempour, S. Eichen- berger, C. Hora, and D. Adolfsson. 2009. Defect-oriented cell-aware ATPG and fault simulation for industrial cell libraries and designs. In2009 International Test Conference

work page 2009

[27] [27]

Zhengyang He, Yafan Huang, Hui Xu, Dingwen Tao, and Guanpeng Li. 2023. Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(Denver, CO, USA) (SC ’23). Association for Computing Machinery, New Y...

work page doi:10.1145/3581784.3607078 2023

[28] [28]

Zhengyang He, Hui Xu, and Guanpeng Li. 2024. A Fast Low-Level Error Detection Technique. In2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 90–98. https://doi.org/10.1109/DSN58291.2024. 00023

work page doi:10.1109/dsn58291.2024 2024

[29] [29]

Heragu, J.H

K. Heragu, J.H. Patel, and V.D. Agrawal. 1996. Segment delay faults: a new fault model. InProceedings of 14th VLSI Test Symposium

work page 1996

[30] [30]

Hochschild, Paul Turner, Jeffrey C

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat. 2021. Cores That Don’t Count. InProceedings of the Workshop on Hot Topics in Operating Systems

work page 2021

[31] [31]

Gardner, and Subhasish Mitra

Ted Hong, Yanjing Li, Sung-Boem Park, Diana Mui, David Lin, Ziyad Abdel Kaleq, Nagib Hakim, Helia Naeimi, Donald S. Gardner, and Subhasish Mitra

work page

[32] [32]

In 2010 IEEE International Test Conference

QED: Quick Error Detection tests for effective post-silicon validation. In 2010 IEEE International Test Conference

work page 2010

[33] [33]

Deutsch, Vincent Quentin Ulitzsch, Sudhanva Gurumurthi, Vilas Srid- haran, Joel S

Yao Hsiao, Nikos Nikoleris, Artem Khyzha, Dominic P. Mulligan, Gustavo Petri, Christopher W. Fletcher, and Caroline Trippel. 2024. RTL2M 𝜇PATH: Multi- 𝜇PATH Synthesis with Applications to Hardware Security Verification. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 507–524. https://doi.org/10.1109/MICRO61859.2024.00045

work page doi:10.1109/micro61859.2024.00045 2024

[34] [34]

Yafan Huang, Shengjian Guo, Sheng Di, Guanpeng Li, and Franck Cappello. 2022. Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs. InSC22: International Conference for High Performance Computing, Net- working, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41404.2022.00022

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00022 2022

[35] [35]

Intel. 2021. OpenDCDiag. https://github.com/opendcdiag

work page 2021

[36] [36]

Nikos Karystinos, Odysseas Chatzopoulos, George-Marios Fragkoulis, George Pa- padimitriou, Dimitris Gizopoulos, and Sudhanva Gurumurthi. 2024. Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 516–531. https://doi.org/10.1109...

work page doi:10.1109/isca59077.2024.00045 2024

[37] [37]

Nikos Karystinos, George-Marios Fragkoulis, Odysseas Chatzopoulos, Dimitris Gizopoulos, and Sudhanva Gurumurthi. 2025. Harpocrates++: Automated Func- tional Program Generation against CPU Faults and Silent Data Corruptions.IEEE Micro(2025), 1–9. https://doi.org/10.1109/MM.2025.3640385

work page doi:10.1109/mm.2025.3640385 2025

[38] [38]

Kundu, S

S. Kundu, S. Sengupta, and R. Galivanche. 2000. Test challenges in nanometer technologies. InProceedings IEEE European Test Workshop

work page 2000

[39] [39]

2021.SiliFuzz: Fuzzing CPUs by proxy

Doug Kwan, Kostik Shtoyk, Kostya Serebryany, Maxim L Lifantsev, and Peter Hochschild. 2021.SiliFuzz: Fuzzing CPUs by proxy. Technical Report. Google

work page 2021

[40] [40]

Li and E.J

J.C.-M. Li and E.J. McCluskey. 2002. Diagnosis of sequence-dependent chips. InProceedings 20th IEEE VLSI Test Symposium (VTS 2002). 187–192. https: //doi.org/10.1109/VTS.2002.1011137

work page doi:10.1109/vts.2002.1011137 2002

[41] [41]

Wei Li, Chris Nigh, Danielle Duvalsaint, Subhasish Mitra, and R. D. Blanton. 2022. PEPR: Pseudo-Exhaustive Physically-Aware Region Testing. InInternational Test Conference

work page 2022

[42] [42]

The LLVM C Library

libcllvm 2024. The LLVM C Library. https://libc.llvm.org/

work page 2024

[43] [43]

LLVM libc++

libcxxllvm 2024. LLVM libc++. https://github.com/llvm/llvm- project/blob/main/libcxx/include/concepts

work page 2024

[44] [44]

David Lin, Ted Hong, Yanjing Li, Farzan Fallah, Donald S Gardner, Nagib Hakim, and Subhasish Mitra. 2013. Overcoming post-silicon validation challenges through quick error detection (QED). In2013 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 320–325

work page 2013

[45] [45]

Gardner, and Subhasish Mitra

David Lin, Ted Hong, Yanjing Li, Eswaran S, Sharad Kumar, Farzan Fallah, Nagib Hakim, Donald S. Gardner, and Subhasish Mitra. 2014. Effective Post-Silicon Validation of System-on-Chips Using Quick Error Detection.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems33, 10 (2014), 1573–1590. https://doi.org/10.1109/TCAD.2014.2334301...

work page doi:10.1109/tcad.2014.2334301 2014

[46] [46]

LLVM Language Reference Manual

llvm-language-ref 2022. LLVM Language Reference Manual. https://llvm.org/ docs/LangRef.html. Accessed: 2022-10-19

work page 2022

[47] [47]

Florian Lonsing, Subhasish Mitra, and Clark W. Barrett. 2020. A Theoretical Framework for Symbolic Quick Error Detection. In2020 Formal Methods in Computer Aided Design, FMCAD 2020, Haifa, Israel, September 21-24, 2020. IEEE, 1–10. https://doi.org/10.34727/2020/ISBN.978-3-85448-042-6_9

work page doi:10.34727/2020/isbn.978-3-85448-042-6_9 2020

[48] [48]

Jiacheng Ma, Majd Ganaiem, Madeline Burbage, Theo Gregersen, Rachel McAmis, Freddy Gabbay, and Baris Kasikci. 2025. Proactive Runtime Detection of Aging- Related Silent Data Corruptions: A Bottom-Up Approach. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4(Hilton La ...

work page doi:10.1145/3622781.3674182 2025

[49] [49]

S.C. Ma, P. Franco, and E.J. McCluskey. 1995. An experimental chip to evaluate test techniques experiment results. InProceedings of 1995 IEEE International Test Conference (ITC)

work page 1995

[50] [50]

May and Murray H

Timothy C. May and Murray H. Woods. 1978. A New Physical Mechanism for Soft Errors in Dynamic Memories. In16th International Reliability Physics Symposium. 33–40. https://doi.org/10.1109/IRPS.1978.362815

work page doi:10.1109/irps.1978.362815 1978

[51] [51]

McCluskey

E.J. McCluskey. 1993. Quality and single-stuck faults. InProceedings of IEEE International Test Conference - (ITC)

work page 1993

[52] [52]

McCluskey and Chao-Wen Tseng

E.J. McCluskey and Chao-Wen Tseng. 2000. Stuck-fault tests vs. actual defects. InProceedings International Test Conference 2000 (IEEE Cat. No.00CH37159)

work page 2000

[53] [53]

Yixuan Mei, Shreya Varshini, Harish Dixit, Sriram Sankar, and K. V. Rashmi. 2026. SEVI: Silent Data Corruption of Vector Instructions in Hyper-Scale Datacenters. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(USA)(ASPLOS ’26). Association for Computing Machinery, Ne...

work page doi:10.1145/3779212.3790217 2026

[54] [54]

Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. Revisiting Mem- ory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field. In2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 415–426. https://doi.org/10.1109/DSN. 2015.57

work page doi:10.1109/dsn 2015

[55] [55]

Liu, Bharath Parthasarathy, and Parthasarathy Ranganathan

Subhasish Mitra, Subho Banerjee, Martin Dixon, Rama Govindaraju, Peter Hochschild, Eric X. Liu, Bharath Parthasarathy, and Parthasarathy Ranganathan

work page

[56] [56]

arXiv:2508.01786 [cs.AR] https://arxiv.org/abs/2508.01786

Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing. arXiv:2508.01786 [cs.AR] https://arxiv.org/abs/2508.01786

work page arXiv

[57] [57]

Mukherjee

S. Mukherjee. 2008.Architecture Design for Soft Errors. https://doi.org/10.1016/ B978-0-12-369529-1.X5001-0

work page 2008

[58] [58]

Mukherjee, J

S.S. Mukherjee, J. Emer, and S.K. Reinhardt. 2005. The soft error problem: an architectural perspective. In11th International Symposium on High-Performance Computer Architecture

work page 2005

[59] [59]

Mukherjee, C

S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin. 2003. A sys- tematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. InProceedings. 36th Annual IEEE/ACM In- ternational Symposium on Microarchitecture, 2003. MICRO-36.29–40. https: //doi.org/10.1109/MICRO.2003.1253181

work page doi:10.1109/micro.2003.1253181 2003

[60] [60]

N. Oh, S. Mitra, and E.J. McCluskey. 2002. ED4I: error detection by diverse data and duplicated instructions.IEEE Trans. Comput.51, 2 (2002), 180–199. https://doi.org/10.1109/12.980007

work page doi:10.1109/12.980007 2002

[61] [61]

N. Oh, P.P. Shirvani, and E.J. McCluskey. 2002. Control-flow checking by software signatures.IEEE Transactions on Reliability51, 1 (2002), 111–122. https://doi. org/10.1109/24.994926

work page doi:10.1109/24.994926 2002

[62] [62]

Nahmsuk Oh, Philip Shirvani, and Edward McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors.IEEE Transactions on Reliability51, 1 (2002), 63–75. https://doi.org/10.1109/24.994913

work page doi:10.1109/24.994913 2002

[63] [63]

OpenHW Group. 2019. CVA6 RISC-V CPU. https://github.com/openhwgroup/ cva6

work page 2019

[64] [64]

openssl 2024. OpenSSL. https://github.com/openssl/openssl

work page 2024

[65] [65]

openssl-manual [n. d.]. OPENSSL Debian Manpages. https://manpages.debian. org/testing/libssl-doc/OPENSSL_LH_doall_arg.3ssl.en.html

work page

[66] [66]

George Papadimitriou and Dimitris Gizopoulos. 2023. AVGI: Microarchitecture- Driven, Fast and Accurate Vulnerability Assessment. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 935–948. https: //doi.org/10.1109/HPCA56546.2023.10071105

work page doi:10.1109/hpca56546.2023.10071105 2023

[67] [67]

George Papadimitriou and Dimitris Gizopoulos. 2023. Silent Data Corruptions: Microarchitectural Perspectives.IEEE Trans. Comput.72, 11 (2023), 3072–3085. https://doi.org/10.1109/TC.2023.3285094

work page doi:10.1109/tc.2023.3285094 2023

[68] [68]

George Papadimitriou, Dimitris Gizopoulos, Harish Dattatraya Dixit, and Sriram Sankar. 2023. Silent Data Corruptions: The Stealthy Saboteurs of Digital Integrity. 2023 IEEE 29th International Symposium on On-Line Testing and Robust System Design (IOLTS)(2023), 1–7. https://api.semanticscholar.org/CorpusID:261315246

work page 2023

[69] [69]

Priyadarsan Patra. 2007. On the cusp of a validation wall.IEEE Design & Test of Computers24, 2 (2007), 193–196. https://doi.org/10.1109/MDT.2007.54

work page doi:10.1109/mdt.2007.54 2007

[70] [70]

Paul, Kunhyuk Kang, Haldun Kufluoglu, Muhammad A

Bipul C. Paul, Kunhyuk Kang, Haldun Kufluoglu, Muhammad A. Alam, and Kaushik Roy. 2007. Negative Bias Temperature Instability: Estimation and Design for Improved Reliability of Nanoscale Circuits.IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems26, 4 (2007), 743–751. https: //doi.org/10.1109/TCAD.2006.884870

work page doi:10.1109/tcad.2006.884870 2007

[71] [71]

Mahesh Prabhu and Jacob A. Abraham. 2012. Functional test generation for hard to detect stuck-at faults using RTL model checking. In2012 17th IEEE European Test Symposium (ETS)

work page 2012

[72] [72]

G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August. 2005. SWIFT: software implemented fault tolerance. InInternational Symposium on Code Gen- eration and Optimization. 243–254. https://doi.org/10.1109/CGO.2005.34

work page doi:10.1109/cgo.2005.34 2005

[73] [73]

Matthias Sauer, Young Moon Kim, Jun Seomun, Hyung-Ock Kim, Kyung-Tae Do, Jung Yun Choi, Kee Sup Kim, Subhasish Mitra, and Bernd Becker. 2013. Early-life-failure detection using SAT-based ATPG. In2013 IEEE International Test Conference (ITC). 1–10. https://doi.org/10.1109/TEST.2013.6651925

work page doi:10.1109/test.2013.6651925 2013

[74] [74]

Jian Shen and Jacob A. Abraham. 1998. Native mode functional test generation for processors with applications to self test and design validation.Proceedings International Test Conference 1998 (IEEE Cat. No.98CH36270)(1998), 990–999. https://api.semanticscholar.org/CorpusID:14132281

work page 1998

[75] [75]

Barrett, and Subhasish Mitra

Eshan Singh, Clark W. Barrett, and Subhasish Mitra. 2017. E-QED: Electrical Bug Localization During Post-silicon Validation Enabled by Quick Error Detection and Formal Methods. (2017)

work page 2017

[76] [76]

Gordon L. Smith. 1985. Model for Delay Faults Based upon Paths. InInternational Test Conference

work page 1985

[77] [77]

Wilson Snyder, Paul Wasson, and Duane Galbi et al. [n. d.].Verilator. https: //verilator.org

work page

[78] [78]

2009.Fault Tolerant Computer Architecture

Daniel Sorin. 2009.Fault Tolerant Computer Architecture. Vol. 4. https://doi.org/ 10.2200/S00192ED1V01Y200904CAC005

work page doi:10.2200/s00192ed1v01y200904cac005 2009

[79] [79]

Storey and W

T.M. Storey and W. Maly. 1990. CMOS bridging fault detection. InProceedings. International Test Conference 1990

work page 1990

[80] [80]

Takeda and N

E. Takeda and N. Suzuki. 1983. An empirical model for device degradation due to hot-carrier injection.IEEE Electron Device Letters4, 4 (1983), 111–113. https://doi.org/10.1109/EDL.1983.25667

work page doi:10.1109/edl.1983.25667 1983