arxiv: 2604.10484 · v1 · submitted 2026-04-12 · 💻 cs.AR

Recognition: unknown

Strix: Re-thinking NPU Reliability from a System Perspective

Jiapeng Guan , Jie Zhang , Hao Zhou , Ran Wei , Dean You , Hui Wang , Yingquan Wang , Tinglue Wang

show 3 more authors

Xudong Zhao Jing Li Zhe Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:18 UTC · model grok-4.3

classification 💻 cs.AR

keywords NPU reliabilityfault localisationDNN acceleratorsinference pipelinesystem-level protectionhardware faultserror correction

0 comments

The pith

Strix re-partitions NPUs to achieve sub-microsecond fault localisation and correction at 1.04 times slowdown

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that coarse-grained replication of entire NPUs creates too much overhead for the reliability demands of modern DNNs and LLMs, especially in safety-critical settings. Strix instead splits the accelerator into segments that follow the steps of the inference pipeline, spots the most common failure modes in each segment, and applies light, specific checks and fixes. This yields error detection and correction in less than a microsecond while adding only four percent slowdown and little extra hardware. A reader focused on practical AI systems would see the result as closing the gap between needed reliability and deployable performance.

Core claim

Strix is a full-stack NPU reliability framework that re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04 times slowdown and minimal hardware overhead on an open-source SoC.

What carries the argument

Re-partitioning the NPU along the inference pipeline to expose and protect against dominant failure modes with targeted safeguards

Load-bearing premise

The failure modes identified as dominant after re-partitioning remain the main ones that actually occur across workloads and process nodes.

What would settle it

Running the system on a new workload or process node and observing a previously unseen failure mode that evades the targeted safeguards and produces undetected errors would show the approach does not cover real faults.

Figures

Figures reproduced from arXiv: 2604.10484 by Dean You, Hao Zhou, Hui Wang, Jiapeng Guan, Jie Zhang, Jing Li, Ran Wei, Tinglue Wang, Xudong Zhao, Yingquan Wang, Zhe Jiang.

**Figure 4.** Figure 4: Components related to local memory reliability. Blue lines: [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The shield group. In (c), #𝑛 − 𝑚 denotes the m-th type of instruction in the n-th group, while 𝑆#𝑛 − 𝑚 is different pipeline stages. by its corresponding checksum and accumulated to compute the row checksum for 𝐴 × 𝐵. Similarly, fixing the row checksums of 𝐴 and streaming in the transpose of 𝐵 (𝐵 𝑇 in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Performance overhead (R.: ResNet-50; A.: AlexNet; M.: [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: The error detection and correction coverage of Strix. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 9.** Figure 9: Impact of different strategies on LLM perplexity and accuracy, [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

read the original abstract

DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Strix, a full-stack NPU reliability framework on an open-source SoC that re-partitions the inference pipeline to identify dominant failure modes and attach targeted safeguards at micro-architecture, ISA, and programming levels. It claims this yields sub-microsecond fault localization, detection, and correction with only 1.04× slowdown and minimal hardware overhead, in contrast to coarse-grained monolithic replication.

Significance. If the measured overheads and coverage hold under the stated assumptions, Strix would meaningfully narrow the gap between reliability requirements and deployable NPU solutions for safety-critical DNN/LLM workloads. The system-level re-partitioning approach, rather than treating the accelerator as a black box, could influence future designs in reliable computing and computer architecture.

major comments (1)

[Abstract] Abstract: the central quantitative claims (1.04× slowdown, sub-microsecond localization/detection/correction) are presented without any accompanying evaluation methodology, workload characterization, or error-bar data. Because the low-overhead guarantee rests on the re-partitioning step producing an exhaustive and stable set of dominant failure modes, the absence of cross-workload or cross-node coverage metrics for that identification step is load-bearing; unmodeled faults would leave coverage gaps that invalidate the claimed overhead bound.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of Strix to narrow the reliability gap for safety-critical NPU workloads. We address the single major comment below, focusing on substance and indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claims (1.04× slowdown, sub-microsecond localization/detection/correction) are presented without any accompanying evaluation methodology, workload characterization, or error-bar data. Because the low-overhead guarantee rests on the re-partitioning step producing an exhaustive and stable set of dominant failure modes, the absence of cross-workload or cross-node coverage metrics for that identification step is load-bearing; unmodeled faults would leave coverage gaps that invalidate the claimed overhead bound.

Authors: The abstract is deliberately concise, as is conventional, while the full evaluation methodology, workload characterization (including representative DNN/LLM models), error-bar reporting from repeated runs, and cross-workload/cross-node validation of the re-partitioning step appear in Sections 4 and 5. The dominant failure modes were identified via exhaustive pipeline analysis on the open-source SoC and shown to be stable across the evaluated workloads and hardware nodes; unmodeled faults outside this set are acknowledged as a coverage limit in the paper. To address the referee's concern about the load-bearing nature of the claims, we will revise the abstract to include a short clause referencing the evaluation methodology and cross-workload stability results. This change will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system design with measured results

full rationale

The paper presents a system design and implementation for NPU reliability, describing re-partitioning of the inference pipeline, identification of dominant failure modes, and attachment of targeted safeguards, with claims supported by measured overheads (1.04× slowdown) on an open-source SoC. No mathematical derivations, equations, fitted parameters presented as predictions, or self-referential logic appear in the abstract or description. The central result is the design and its empirical evaluation rather than a chain that reduces to its own inputs by construction. No load-bearing self-citations, self-definitional steps, or other enumerated circular patterns are present. This is consistent with a typical hardware/systems paper where the output is the artifact itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the framework rests on the assumption that dominant failure modes can be identified per pipeline stage and that targeted safeguards suffice.

axioms (1)

domain assumption Dominant failure modes in NPUs can be identified after re-partitioning along the inference pipeline and addressed with stage-specific safeguards.
Central to the claim that targeted rather than monolithic protection is sufficient.

pith-pipeline@v0.9.0 · 5453 in / 1200 out tokens · 52488 ms · 2026-05-10T16:18:13.829656+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 6 canonical work pages · 4 internal anchors

[1]

2023.Resilience assessment of machine learning applica- tions under hardware faults

Udit Kumar Agarwal. 2023.Resilience assessment of machine learning applica- tions under hardware faults. Ph.D. Dissertation. University of British Columbia

2023
[2]

Udit Kumar Agarwal et al. 2023. Towards reliability assessment of systolic arrays against stuck-at faults. InDSN-S. IEEE, 230–236

2023
[3]

Performance characterization of using quantization for DNN inference on edge devices

Hyunho Ahn et al.2023. Performance characterization of using quantization for DNN inference on edge devices. InICFEC. IEEE, 1–6

2023
[4]

Haya Al Kassir et al. 2022. A review of the state of the art and future challenges of deep learning-based beamforming.IEEE Access10 (2022), 80869–80882

2022
[5]

GPU scheduling on the NVIDIA TX2: Hidden details revealed

Tanya Amert et al.2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. InRTSS. IEEE, 104–115

2017
[6]

Practical fault attack on deep neural networks

Jakub Breier et al.2018. Practical fault attack on deep neural networks. InACM CCS. 2204–2206

2018
[7]

Wei Cao et al. 2023. The future transistors.Nature620, 7974 (2023), 501–515

2023
[8]

Yu-Hsin Chen et al. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices.IEEE JESTCS9, 2 (2019), 292–308

2019
[9]

Asurveyonmultimodallargelanguagemodelsforautonomous driving

CanCuietal .2024. Asurveyonmultimodallargelanguagemodelsforautonomous driving. InWACV. IEEE, 958–979

2024
[10]

Gpt3.int8():8-bitmatrixmultiplicationfortransformers at scale.NeurIPS35 (2022), 30318–30332

TimDettmersetal .2022. Gpt3.int8():8-bitmatrixmultiplicationfortransformers at scale.NeurIPS35 (2022), 30318–30332

2022
[11]

Michael Ditty. 2022. Nvidia orin system-on-chip. InHCS. IEEE, 1–17

2022
[12]

Fernando Fernandes Dos Santos et al. 2023. Understanding and Improving GPUs’ Reliability Combining Beam Experiments with Fault Simulation. InITC. IEEE, 176–185

2023
[13]

Joshua Fromm et al. 2018. Heterogeneous bitwidth binarization in convolutional neural networks.NeurIPS31 (2018)

2018
[14]

Daocheng Fu et al. 2024. Drive like a human: Rethinking autonomous driving with large language models. InWACVW. IEEE, 910–919

2024
[15]

Onthedependabilityofbidirectionalencoderrepresentations from transformers to soft errors.IEEE TNANO(2025)

ZhenGaoetal .2025. Onthedependabilityofbidirectionalencoderrepresentations from transformers to soft errors.IEEE TNANO(2025)

2025
[16]

Gemmini:Enablingsystematicdeep-learningarchitecture evaluation via full-stack integration

HasanGencetal .2021. Gemmini:Enablingsystematicdeep-learningarchitecture evaluation via full-stack integration. InDAC. IEEE, 769–774

2021
[17]

TheDarkSideofComputing:SilentDataCorruptions

DimitrisGizopoulos.2025. TheDarkSideofComputing:SilentDataCorruptions. Computer58, 6 (2025), 101–106

2025
[18]

Deeplearningaccelerators’configura- tion space exploration effect on performance and resource utilization: A Gemmini case study.Sensors23, 5 (2023), 2380

DennisAgyemanhNanaGookyietal .2023. Deeplearningaccelerators’configura- tion space exploration effect on performance and resource utilization: A Gemmini case study.Sensors23, 5 (2023), 2380

2023
[19]

The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

Aaron Grattafiori et al.2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024
[20]

Wilfread Guillemé et al. 2024. HTAG-eNN: Hardening Technique with AND Gates for Embedded Neural Networks. InACM/IEEE DAC. 1–6

2024
[21]

Neuralnetworkreliabilityanalysisbasedonfaultinjection

BaixinGuoetal .2023. Neuralnetworkreliabilityanalysisbasedonfaultinjection. InCNML. 366–370

2023
[22]

Daya Guo et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196 (2024)

work page internal anchor Pith review arXiv 2024
[23]

Jiajun He et al. 2025. Fine-Grained Fault Sensitivity Analysis of Vision Trans- formers Under Soft Errors.Electronics14, 12 (2025), 2418

2025
[24]

Kaiming He et al. 2016. Deep residual learning for image recognition. InIEEE CVPR. 770–778

2016
[25]

Le-Ha Hoang et al. 2020. Ft-clipact: Resilience analysis of deep neural networks and improving their fault tolerance using clipped activation. InDATE. IEEE, 1241–1246

2020
[26]

Denselyconnectedconvolutionalnetworks.InIEEE CVPR

GaoHuangetal .2017. Denselyconnectedconvolutionalnetworks.InIEEE CVPR. 4700–4708

2017
[27]

Binyuan Hui et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Younis Ibrahim et al. 2020. Analyzing the reliability of convolutional neural networks on gpus: Googlenet as a case study. InICCIT. IEEE, 1–6

2020
[29]

Younis Ibrahim et al. 2020. Soft error resilience of deep residual networks for object recognition.IEEE Access8 (2020), 19490–19503

2020
[30]

Younis Ibrahim et al. 2020. Soft errors in DNN accelerators: A comprehensive review.Microelectronics Reliability115 (2020), 113969

2020
[31]

ISO26262 ISO. 2018. 26262: Road vehicles-Functional safety. (2018)

2018
[32]

Haojie Jian et al. 2024. PerFT-N: Low-overhead Permanent Fault-Tolerance Mechanism for Neural Processing Units. InGLSVLSI. 25–31

2024
[33]

Mistral 7B

AlbertQ.Jiangetal .2023. Mistral7B.arXiv preprint arXiv:2310.068253(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Jeff Johnson. 2018. Rethinking floating point for deep learning.arXiv preprint arXiv:1811.01721(2018)

work page Pith review arXiv 2018
[35]

A domain-specific supercomputer for training deep neural networks.Commun

Norman P Jouppi et al.2020. A domain-specific supercomputer for training deep neural networks.Commun. ACM63, 7 (2020), 67–78

2020
[36]

Bert:Pre-trainingofdeepbidirectionaltransform- ers for language understanding

JacobDevlinKentonetal .2019. Bert:Pre-trainingofdeepbidirectionaltransform- ers for language understanding. InnaacL-HLT, Vol. 1. Minneapolis, Minnesota, 2

2019
[37]

K Korosec. 2021. Tesla will open controversial FSD Beta software to owners with a good driving record

2021
[38]

Alex Krizhevsky et al. 2012. Imagenet classification with deep convolutional neural networks.NeurIPS25 (2012)

2012
[39]

Guanpeng Li et al. 2025. Understanding Error Propagation in Deep-Learning Neural Networks Accelerators and Applications.IEEE Des. Test(2025)

2025
[40]

High-performance FPGA-based CNN accelerator with block-floating-point arithmetic.IEEE VLSI27, 8 (2019), 1874–1885

Xiaocong Lian et al.2019. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic.IEEE VLSI27, 8 (2019), 1874–1885

2019
[41]

Fabiano Libano et al. 2018. Selective hardening for neural networks in FPGAs. IEEE TNS66, 1 (2018), 216–222

2018
[42]

Hsin-Chen Lu et al. 2024. Highly Fault-Tolerant Systolic-Array-Based Matrix Multiplication.Electronics13, 9 (2024), 1780

2024
[43]

Sparsh Mittal. 2020. A survey on modeling and improving reliability of DNN algorithms and accelerators.JSA104 (2020), 101689

2020
[44]

Pramesh Pandey et al. 2019. GreenTPU: Improving timing error resilience of a near-threshold tensor processing unit. InACM/IEEE DAC. 1–6

2019
[45]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno et al.2016. The LAMBADA dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031(2016)

work page Pith review arXiv 2016
[46]

Sigma:Asparseandirregulargemmacceleratorwithflexible interconnects for dnn training

EricQinetal .2020. Sigma:Asparseandirregulargemmacceleratorwithflexible interconnects for dnn training. InHPCA. IEEE, 58–70

2020
[47]

Harsh Rangwani et al. 2022. Cost-sensitive self-training for optimizing non- decomposable metrics.NeurIPS35 (2022), 26994–27007

2022
[48]

Brandon Reagen et al. 2018. Ares: A framework for quantifying the resilience of deep neural networks. InACM/IEEE DAC. 1–6

2018
[49]

RISC-V Software Source. 2024. riscv-isa-sim: RISC-V ISA Simulator (Spike). https://github.com/riscv-software-src/riscv-isa-sim

2024
[50]

Yamato Saikawa and Yoichi Tomioka. 2024. Approximated Triple Modular Redundancy of Convolutional Neural Networks Based on Residual Quantization. InMCSoC. IEEE, 302–309

2024
[51]

Errormitigationusingapproximatelogic circuits: A comparison of probabilistic and evolutionary approaches.IEEE TR65, 4 (2016), 1871–1883

AntonioJSanchez-Clementeetal .2016. Errormitigationusingapproximatelogic circuits: A comparison of probabilistic and evolutionary approaches.IEEE TR65, 4 (2016), 1871–1883

2016
[52]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler et al.2018. Mobilenetv2: Inverted residuals and linear bottlenecks. InIEEE CVPR. 4510–4520

2018
[53]

Dongjoo Shin et al. 2018. DNPU: An energy-efficient deep-learning processor with heterogeneous multi-core architecture.IEEE Micro38, 5 (2018), 85–93

2018
[54]

Rethinking the inception architecture for computer vision

Christian Szegedy et al.2016. Rethinking the inception architecture for computer vision. InIEEE CVPR. 2818–2826

2016
[55]

Mahdi Taheri et al. 2024. Exploration of activation fault reliability in quantized systolic array-based dnn accelerators. InISQED. IEEE, 1–8

2024
[56]

Dojo:The microarchitecture of tesla’sexa-scalecomputer

Emil Talpes etal.2022. Dojo:The microarchitecture of tesla’sexa-scalecomputer. InHCS. IEEE, 1–28

2022
[57]

Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InICML. PMLR, 6105–6114

2019
[58]

UC Berkeley. [n.d.]. Gemmini — Systolic Array and Transposer.https://github. com/ucb-bar/gemmini#systolic-array-and-transposer
[59]

Conference’17, July 2017, Washington, DC, USA Jiapeng Guan et al

JoãoVieiraetal .2023.Gem5-accel:Apre-RTLsimulationtoolchainforaccelerator architecture validation.CAL(2023). Conference’17, July 2017, Washington, DC, USA Jiapeng Guan et al

2023
[60]

Jinghe Wei et al. 2020. Analyzing the impact of soft errors in VGG networks implemented on GPUs.Microelectronics Reliability110 (2020), 113648

2020
[61]

Wenda Wei et al. 2023. An approximate fault-tolerance design for a convolutional neural network accelerator.IT Professional25, 4 (2023), 85–90

2023
[62]

Tong Xie et al. 2025. ReaLM: Reliable and Efficient Large Language Model Inference with Statistical Algorithm-Based Fault Tolerance. InACM/IEEE DAC. 1–6

2025
[63]

Leon Yao and John Miller. 2015. Tiny imagenet classification with convolutional neural networks.CS 231N2, 5 (2015), 8

2015
[64]

Yiren Zhou et al. 2018. Adaptive quantization for deep neural network. In Proceedings of AAAI, Vol. 32

2018
[65]

A Survey on Efficient Inference for Large Language Models

ZixuanZhouetal .2024. Asurveyonefficientinferenceforlargelanguagemodels. arXiv preprint arXiv:2404.14294(2024)

work page internal anchor Pith review arXiv 2024