pith. machine review for the scientific record. sign in

arxiv: 2604.10484 · v1 · submitted 2026-04-12 · 💻 cs.AR

Recognition: unknown

Strix: Re-thinking NPU Reliability from a System Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:18 UTC · model grok-4.3

classification 💻 cs.AR
keywords NPU reliabilityfault localisationDNN acceleratorsinference pipelinesystem-level protectionhardware faultserror correction
0
0 comments X

The pith

Strix re-partitions NPUs to achieve sub-microsecond fault localisation and correction at 1.04 times slowdown

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that coarse-grained replication of entire NPUs creates too much overhead for the reliability demands of modern DNNs and LLMs, especially in safety-critical settings. Strix instead splits the accelerator into segments that follow the steps of the inference pipeline, spots the most common failure modes in each segment, and applies light, specific checks and fixes. This yields error detection and correction in less than a microsecond while adding only four percent slowdown and little extra hardware. A reader focused on practical AI systems would see the result as closing the gap between needed reliability and deployable performance.

Core claim

Strix is a full-stack NPU reliability framework that re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04 times slowdown and minimal hardware overhead on an open-source SoC.

What carries the argument

Re-partitioning the NPU along the inference pipeline to expose and protect against dominant failure modes with targeted safeguards

Load-bearing premise

The failure modes identified as dominant after re-partitioning remain the main ones that actually occur across workloads and process nodes.

What would settle it

Running the system on a new workload or process node and observing a previously unseen failure mode that evades the targeted safeguards and produces undetected errors would show the approach does not cover real faults.

Figures

Figures reproduced from arXiv: 2604.10484 by Dean You, Hao Zhou, Hui Wang, Jiapeng Guan, Jie Zhang, Jing Li, Ran Wei, Tinglue Wang, Xudong Zhao, Yingquan Wang, Zhe Jiang.

Figure 2
Figure 2. Figure 2: The architectural overview of Strix. Strix applies module-specific hardware safeguards for NPUs. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Components related to local memory reliability. Blue lines: [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The shield group. In (c), #𝑛 − 𝑚 denotes the m-th type of instruction in the n-th group, while 𝑆#𝑛 − 𝑚 is different pipeline stages. by its corresponding checksum and accumulated to compute the row checksum for 𝐴 × 𝐵. Similarly, fixing the row checksums of 𝐴 and streaming in the transpose of 𝐵 (𝐵 𝑇 in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance overhead (R.: ResNet-50; A.: AlexNet; M.: [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The error detection and correction coverage of Strix. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of different strategies on LLM perplexity and accuracy, [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
read the original abstract

DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Strix, a full-stack NPU reliability framework on an open-source SoC that re-partitions the inference pipeline to identify dominant failure modes and attach targeted safeguards at micro-architecture, ISA, and programming levels. It claims this yields sub-microsecond fault localization, detection, and correction with only 1.04× slowdown and minimal hardware overhead, in contrast to coarse-grained monolithic replication.

Significance. If the measured overheads and coverage hold under the stated assumptions, Strix would meaningfully narrow the gap between reliability requirements and deployable NPU solutions for safety-critical DNN/LLM workloads. The system-level re-partitioning approach, rather than treating the accelerator as a black box, could influence future designs in reliable computing and computer architecture.

major comments (1)
  1. [Abstract] Abstract: the central quantitative claims (1.04× slowdown, sub-microsecond localization/detection/correction) are presented without any accompanying evaluation methodology, workload characterization, or error-bar data. Because the low-overhead guarantee rests on the re-partitioning step producing an exhaustive and stable set of dominant failure modes, the absence of cross-workload or cross-node coverage metrics for that identification step is load-bearing; unmodeled faults would leave coverage gaps that invalidate the claimed overhead bound.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of Strix to narrow the reliability gap for safety-critical NPU workloads. We address the single major comment below, focusing on substance and indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central quantitative claims (1.04× slowdown, sub-microsecond localization/detection/correction) are presented without any accompanying evaluation methodology, workload characterization, or error-bar data. Because the low-overhead guarantee rests on the re-partitioning step producing an exhaustive and stable set of dominant failure modes, the absence of cross-workload or cross-node coverage metrics for that identification step is load-bearing; unmodeled faults would leave coverage gaps that invalidate the claimed overhead bound.

    Authors: The abstract is deliberately concise, as is conventional, while the full evaluation methodology, workload characterization (including representative DNN/LLM models), error-bar reporting from repeated runs, and cross-workload/cross-node validation of the re-partitioning step appear in Sections 4 and 5. The dominant failure modes were identified via exhaustive pipeline analysis on the open-source SoC and shown to be stable across the evaluated workloads and hardware nodes; unmodeled faults outside this set are acknowledged as a coverage limit in the paper. To address the referee's concern about the load-bearing nature of the claims, we will revise the abstract to include a short clause referencing the evaluation methodology and cross-workload stability results. This change will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system design with measured results

full rationale

The paper presents a system design and implementation for NPU reliability, describing re-partitioning of the inference pipeline, identification of dominant failure modes, and attachment of targeted safeguards, with claims supported by measured overheads (1.04× slowdown) on an open-source SoC. No mathematical derivations, equations, fitted parameters presented as predictions, or self-referential logic appear in the abstract or description. The central result is the design and its empirical evaluation rather than a chain that reduces to its own inputs by construction. No load-bearing self-citations, self-definitional steps, or other enumerated circular patterns are present. This is consistent with a typical hardware/systems paper where the output is the artifact itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the framework rests on the assumption that dominant failure modes can be identified per pipeline stage and that targeted safeguards suffice.

axioms (1)
  • domain assumption Dominant failure modes in NPUs can be identified after re-partitioning along the inference pipeline and addressed with stage-specific safeguards.
    Central to the claim that targeted rather than monolithic protection is sufficient.

pith-pipeline@v0.9.0 · 5453 in / 1200 out tokens · 52488 ms · 2026-05-10T16:18:13.829656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    2023.Resilience assessment of machine learning applica- tions under hardware faults

    Udit Kumar Agarwal. 2023.Resilience assessment of machine learning applica- tions under hardware faults. Ph.D. Dissertation. University of British Columbia

  2. [2]

    Udit Kumar Agarwal et al. 2023. Towards reliability assessment of systolic arrays against stuck-at faults. InDSN-S. IEEE, 230–236

  3. [3]

    Performance characterization of using quantization for DNN inference on edge devices

    Hyunho Ahn et al.2023. Performance characterization of using quantization for DNN inference on edge devices. InICFEC. IEEE, 1–6

  4. [4]

    Haya Al Kassir et al. 2022. A review of the state of the art and future challenges of deep learning-based beamforming.IEEE Access10 (2022), 80869–80882

  5. [5]

    GPU scheduling on the NVIDIA TX2: Hidden details revealed

    Tanya Amert et al.2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. InRTSS. IEEE, 104–115

  6. [6]

    Practical fault attack on deep neural networks

    Jakub Breier et al.2018. Practical fault attack on deep neural networks. InACM CCS. 2204–2206

  7. [7]

    Wei Cao et al. 2023. The future transistors.Nature620, 7974 (2023), 501–515

  8. [8]

    Yu-Hsin Chen et al. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices.IEEE JESTCS9, 2 (2019), 292–308

  9. [9]

    Asurveyonmultimodallargelanguagemodelsforautonomous driving

    CanCuietal .2024. Asurveyonmultimodallargelanguagemodelsforautonomous driving. InWACV. IEEE, 958–979

  10. [10]

    Gpt3.int8():8-bitmatrixmultiplicationfortransformers at scale.NeurIPS35 (2022), 30318–30332

    TimDettmersetal .2022. Gpt3.int8():8-bitmatrixmultiplicationfortransformers at scale.NeurIPS35 (2022), 30318–30332

  11. [11]

    Michael Ditty. 2022. Nvidia orin system-on-chip. InHCS. IEEE, 1–17

  12. [12]

    Fernando Fernandes Dos Santos et al. 2023. Understanding and Improving GPUs’ Reliability Combining Beam Experiments with Fault Simulation. InITC. IEEE, 176–185

  13. [13]

    Joshua Fromm et al. 2018. Heterogeneous bitwidth binarization in convolutional neural networks.NeurIPS31 (2018)

  14. [14]

    Daocheng Fu et al. 2024. Drive like a human: Rethinking autonomous driving with large language models. InWACVW. IEEE, 910–919

  15. [15]

    Onthedependabilityofbidirectionalencoderrepresentations from transformers to soft errors.IEEE TNANO(2025)

    ZhenGaoetal .2025. Onthedependabilityofbidirectionalencoderrepresentations from transformers to soft errors.IEEE TNANO(2025)

  16. [16]

    Gemmini:Enablingsystematicdeep-learningarchitecture evaluation via full-stack integration

    HasanGencetal .2021. Gemmini:Enablingsystematicdeep-learningarchitecture evaluation via full-stack integration. InDAC. IEEE, 769–774

  17. [17]

    TheDarkSideofComputing:SilentDataCorruptions

    DimitrisGizopoulos.2025. TheDarkSideofComputing:SilentDataCorruptions. Computer58, 6 (2025), 101–106

  18. [18]

    Deeplearningaccelerators’configura- tion space exploration effect on performance and resource utilization: A Gemmini case study.Sensors23, 5 (2023), 2380

    DennisAgyemanhNanaGookyietal .2023. Deeplearningaccelerators’configura- tion space exploration effect on performance and resource utilization: A Gemmini case study.Sensors23, 5 (2023), 2380

  19. [19]

    The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

    Aaron Grattafiori et al.2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  20. [20]

    Wilfread Guillemé et al. 2024. HTAG-eNN: Hardening Technique with AND Gates for Embedded Neural Networks. InACM/IEEE DAC. 1–6

  21. [21]

    Neuralnetworkreliabilityanalysisbasedonfaultinjection

    BaixinGuoetal .2023. Neuralnetworkreliabilityanalysisbasedonfaultinjection. InCNML. 366–370

  22. [22]

    Daya Guo et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196 (2024)

  23. [23]

    Jiajun He et al. 2025. Fine-Grained Fault Sensitivity Analysis of Vision Trans- formers Under Soft Errors.Electronics14, 12 (2025), 2418

  24. [24]

    Kaiming He et al. 2016. Deep residual learning for image recognition. InIEEE CVPR. 770–778

  25. [25]

    Le-Ha Hoang et al. 2020. Ft-clipact: Resilience analysis of deep neural networks and improving their fault tolerance using clipped activation. InDATE. IEEE, 1241–1246

  26. [26]

    Denselyconnectedconvolutionalnetworks.InIEEE CVPR

    GaoHuangetal .2017. Denselyconnectedconvolutionalnetworks.InIEEE CVPR. 4700–4708

  27. [27]

    Binyuan Hui et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  28. [28]

    Younis Ibrahim et al. 2020. Analyzing the reliability of convolutional neural networks on gpus: Googlenet as a case study. InICCIT. IEEE, 1–6

  29. [29]

    Younis Ibrahim et al. 2020. Soft error resilience of deep residual networks for object recognition.IEEE Access8 (2020), 19490–19503

  30. [30]

    Younis Ibrahim et al. 2020. Soft errors in DNN accelerators: A comprehensive review.Microelectronics Reliability115 (2020), 113969

  31. [31]

    ISO26262 ISO. 2018. 26262: Road vehicles-Functional safety. (2018)

  32. [32]

    Haojie Jian et al. 2024. PerFT-N: Low-overhead Permanent Fault-Tolerance Mechanism for Neural Processing Units. InGLSVLSI. 25–31

  33. [33]

    Mistral 7B

    AlbertQ.Jiangetal .2023. Mistral7B.arXiv preprint arXiv:2310.068253(2023)

  34. [34]

    Jeff Johnson. 2018. Rethinking floating point for deep learning.arXiv preprint arXiv:1811.01721(2018)

  35. [35]

    A domain-specific supercomputer for training deep neural networks.Commun

    Norman P Jouppi et al.2020. A domain-specific supercomputer for training deep neural networks.Commun. ACM63, 7 (2020), 67–78

  36. [36]

    Bert:Pre-trainingofdeepbidirectionaltransform- ers for language understanding

    JacobDevlinKentonetal .2019. Bert:Pre-trainingofdeepbidirectionaltransform- ers for language understanding. InnaacL-HLT, Vol. 1. Minneapolis, Minnesota, 2

  37. [37]

    K Korosec. 2021. Tesla will open controversial FSD Beta software to owners with a good driving record

  38. [38]

    Alex Krizhevsky et al. 2012. Imagenet classification with deep convolutional neural networks.NeurIPS25 (2012)

  39. [39]

    Guanpeng Li et al. 2025. Understanding Error Propagation in Deep-Learning Neural Networks Accelerators and Applications.IEEE Des. Test(2025)

  40. [40]

    High-performance FPGA-based CNN accelerator with block-floating-point arithmetic.IEEE VLSI27, 8 (2019), 1874–1885

    Xiaocong Lian et al.2019. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic.IEEE VLSI27, 8 (2019), 1874–1885

  41. [41]

    Fabiano Libano et al. 2018. Selective hardening for neural networks in FPGAs. IEEE TNS66, 1 (2018), 216–222

  42. [42]

    Hsin-Chen Lu et al. 2024. Highly Fault-Tolerant Systolic-Array-Based Matrix Multiplication.Electronics13, 9 (2024), 1780

  43. [43]

    Sparsh Mittal. 2020. A survey on modeling and improving reliability of DNN algorithms and accelerators.JSA104 (2020), 101689

  44. [44]

    Pramesh Pandey et al. 2019. GreenTPU: Improving timing error resilience of a near-threshold tensor processing unit. InACM/IEEE DAC. 1–6

  45. [45]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno et al.2016. The LAMBADA dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031(2016)

  46. [46]

    Sigma:Asparseandirregulargemmacceleratorwithflexible interconnects for dnn training

    EricQinetal .2020. Sigma:Asparseandirregulargemmacceleratorwithflexible interconnects for dnn training. InHPCA. IEEE, 58–70

  47. [47]

    Harsh Rangwani et al. 2022. Cost-sensitive self-training for optimizing non- decomposable metrics.NeurIPS35 (2022), 26994–27007

  48. [48]

    Brandon Reagen et al. 2018. Ares: A framework for quantifying the resilience of deep neural networks. InACM/IEEE DAC. 1–6

  49. [49]

    RISC-V Software Source. 2024. riscv-isa-sim: RISC-V ISA Simulator (Spike). https://github.com/riscv-software-src/riscv-isa-sim

  50. [50]

    Yamato Saikawa and Yoichi Tomioka. 2024. Approximated Triple Modular Redundancy of Convolutional Neural Networks Based on Residual Quantization. InMCSoC. IEEE, 302–309

  51. [51]

    Errormitigationusingapproximatelogic circuits: A comparison of probabilistic and evolutionary approaches.IEEE TR65, 4 (2016), 1871–1883

    AntonioJSanchez-Clementeetal .2016. Errormitigationusingapproximatelogic circuits: A comparison of probabilistic and evolutionary approaches.IEEE TR65, 4 (2016), 1871–1883

  52. [52]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Mark Sandler et al.2018. Mobilenetv2: Inverted residuals and linear bottlenecks. InIEEE CVPR. 4510–4520

  53. [53]

    Dongjoo Shin et al. 2018. DNPU: An energy-efficient deep-learning processor with heterogeneous multi-core architecture.IEEE Micro38, 5 (2018), 85–93

  54. [54]

    Rethinking the inception architecture for computer vision

    Christian Szegedy et al.2016. Rethinking the inception architecture for computer vision. InIEEE CVPR. 2818–2826

  55. [55]

    Mahdi Taheri et al. 2024. Exploration of activation fault reliability in quantized systolic array-based dnn accelerators. InISQED. IEEE, 1–8

  56. [56]

    Dojo:The microarchitecture of tesla’sexa-scalecomputer

    Emil Talpes etal.2022. Dojo:The microarchitecture of tesla’sexa-scalecomputer. InHCS. IEEE, 1–28

  57. [57]

    Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InICML. PMLR, 6105–6114

  58. [58]

    UC Berkeley. [n.d.]. Gemmini — Systolic Array and Transposer.https://github. com/ucb-bar/gemmini#systolic-array-and-transposer

  59. [59]

    Conference’17, July 2017, Washington, DC, USA Jiapeng Guan et al

    JoãoVieiraetal .2023.Gem5-accel:Apre-RTLsimulationtoolchainforaccelerator architecture validation.CAL(2023). Conference’17, July 2017, Washington, DC, USA Jiapeng Guan et al

  60. [60]

    Jinghe Wei et al. 2020. Analyzing the impact of soft errors in VGG networks implemented on GPUs.Microelectronics Reliability110 (2020), 113648

  61. [61]

    Wenda Wei et al. 2023. An approximate fault-tolerance design for a convolutional neural network accelerator.IT Professional25, 4 (2023), 85–90

  62. [62]

    Tong Xie et al. 2025. ReaLM: Reliable and Efficient Large Language Model Inference with Statistical Algorithm-Based Fault Tolerance. InACM/IEEE DAC. 1–6

  63. [63]

    Leon Yao and John Miller. 2015. Tiny imagenet classification with convolutional neural networks.CS 231N2, 5 (2015), 8

  64. [64]

    Yiren Zhou et al. 2018. Adaptive quantization for deep neural network. In Proceedings of AAAI, Vol. 32

  65. [65]

    A Survey on Efficient Inference for Large Language Models

    ZixuanZhouetal .2024. Asurveyonefficientinferenceforlargelanguagemodels. arXiv preprint arXiv:2404.14294(2024)