pith. sign in

arxiv: 2604.23831 · v1 · submitted 2026-04-26 · 💻 cs.AR · cs.SY· eess.SY

Architectural Isolation as a Timing Safety Primitive for Edge AI Medical Devices: Controlled Experimental Evidence on a Shared-Silicon Platform

Pith reviewed 2026-05-08 04:59 UTC · model grok-4.3

classification 💻 cs.AR cs.SYeess.SY
keywords architectural isolationtiming safetyedge AI medical deviceslatency verificationoutput stabilityshared silicon platforminference layer validation
0
0 comments X

The pith

Accuracy and output stability can hold while timing constraints fail on shared hardware for edge AI medical devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that passing accuracy validation and keeping outputs stable does not ensure a device will meet timing requirements when running under real load. It runs the same model on two different execution paths of one shared-silicon chip, a GPU accelerator path and a CPU path. Both paths produce identical stable outputs with no safety threshold exceedances, yet only the GPU path keeps latency low enough for a 10 Hz clinical cycle while the CPU path slows dramatically. This demonstrates that timing behavior is a separate property from accuracy and stability. The authors propose checking both stability and latency together as a way to strengthen safety validation for edge medical devices.

Core claim

A system can satisfy accuracy-based validation, maintain output stability with Safety-Threshold Exceedance Rate equal to zero, and still violate timing constraints under deployment load. These are structurally independent properties. The demonstration uses identical MobileNetV2 models under identical adversarial load on two paths of the same NVIDIA Jetson Orin Nano: the TensorRT FP16 GPU path keeps mean latency below 11 ms while the ONNX Runtime FP32 CPU path shows 9.8 times higher mean latency and breaches the 10 Hz budget by 65 percent, even though both paths maintain STER equal to zero.

What carries the argument

Architectural isolation on a shared-silicon platform, implemented by running identical models on a dedicated GPU accelerator versus a general-purpose CPU under the same combined load to separate timing behavior from accuracy and stability outcomes.

Load-bearing premise

The specific adversarial load, MobileNetV2 model, and Jetson Orin Nano hardware setup under combined load are representative of real medical edge device conditions and that zero safety-threshold exceedance rate equates to clinical safety.

What would settle it

Repeating the identical experiment on the same hardware and model but finding that the CPU path maintains latency below 100 ms under combined load while keeping STER equal to zero would show the reported timing violation is not independent of the other properties.

Figures

Figures reproduced from arXiv: 2604.23831 by Akul Mallayya Swami.

Figure 1
Figure 1. Figure 1: Jetson Orin Nano Super execution paths. GPU path (TensorRT FP16) uses dedicated DMA view at source ↗
Figure 2
Figure 2. Figure 2: Empirical cumulative distribution function (CDF) of inference latency for GPU and CPU paths view at source ↗
Figure 3
Figure 3. Figure 3: STER vs. mean latency for all experimental conditions. Both GPU and CPU paths maintain view at source ↗
read the original abstract

A system can satisfy accuracy-based validation, maintain output stability (Safety-Threshold Exceedance Rate, STER, equal to zero), and still violate timing constraints under deployment load. These are structurally independent properties that current pre-market validation protocols often do not operationalize at the inference layer. This letter demonstrates their independence through a controlled same-hardware experiment: identical MobileNetV2 models are evaluated under identical adversarial load on two execution paths of the same NVIDIA Jetson Orin Nano Super, a dedicated GPU accelerator (TensorRT FP16, half-precision floating point) and a general-purpose CPU (ONNX Runtime FP32, single-precision floating point). Both paths maintain STER = 0; the CPU path (ONNX Runtime FP32) degrades 7.2x under combined load (mean latency 9.8x higher than the GPU path (TensorRT FP16), which maintains latency below 11 ms), breaching the 10 Hz clinical cycle budget by 65%. Joint STER and latency verification is proposed as a candidate method for operationalizing U.S. FDA Draft Guidance FDA-2024-D-4488 robustness requirements at the inference layer, subject to regulatory review and clinical validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that accuracy-based validation and zero Safety-Threshold Exceedance Rate (STER) do not guarantee satisfaction of timing constraints under deployment load in edge AI medical devices. It demonstrates this independence via a controlled same-silicon experiment on an NVIDIA Jetson Orin Nano using identical MobileNetV2 models: the TensorRT FP16 GPU path maintains latency below 11 ms (meeting the 10 Hz budget), while the ONNX Runtime FP32 CPU path degrades 7.2x in latency (9.8x higher mean, breaching the budget by 65%), yet both paths achieve STER=0. The work proposes joint STER and latency verification to operationalize FDA robustness guidance at the inference layer.

Significance. If the reported divergence holds under fuller methodological scrutiny, the result supplies a clear existence proof that functional stability and timing safety are separable properties, directly relevant to architectural isolation techniques for shared-silicon edge platforms. The same-hardware, same-load design is a strength that minimizes confounding variables and provides falsifiable, quantitative evidence (specific degradation factors and breach percentage) that could inform pre-market validation protocols.

major comments (2)
  1. [Abstract / Experimental Results] Abstract and experimental description: the central quantitative claims (7.2x degradation, 9.8x mean latency, 65% breach) are presented without error bars, number of trials, load-generation parameters, or any statistical tests. These omissions are load-bearing because the independence claim rests on the reliability of the observed timing difference between the two paths.
  2. [Methods] Methods: insufficient detail is given on how the adversarial load was constructed and applied identically to both execution paths, and on the precise definition and measurement protocol for STER=0. Without these, it is not possible to replicate or assess whether the CPU-path violation is robust or an artifact of the specific setup.
minor comments (2)
  1. [Abstract] Clarify whether the 10 Hz budget is a hard clinical requirement or a chosen threshold, and state the exact latency target used for the breach calculation.
  2. [Discussion] The title emphasizes 'Architectural Isolation as a Timing Safety Primitive'; the manuscript would benefit from a short paragraph explicitly linking the observed CPU/GPU divergence to isolation mechanisms rather than leaving it implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving experimental transparency and replicability, and we have revised the paper accordingly to address them directly.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and experimental description: the central quantitative claims (7.2x degradation, 9.8x mean latency, 65% breach) are presented without error bars, number of trials, load-generation parameters, or any statistical tests. These omissions are load-bearing because the independence claim rests on the reliability of the observed timing difference between the two paths.

    Authors: We agree that these statistical and methodological details are essential to substantiate the quantitative claims and the independence result. In the revised manuscript we have expanded the experimental results section to report the number of trials (1,000 independent inferences per execution path and load condition), error bars as standard deviation, the load-generation parameters (fixed concurrent CPU- and memory-bound processes launched via the same script), and a statistical comparison (Wilcoxon rank-sum test, p < 0.001) confirming the latency difference between paths. These additions directly support the reliability of the observed divergence while preserving the original quantitative findings. revision: yes

  2. Referee: [Methods] Methods: insufficient detail is given on how the adversarial load was constructed and applied identically to both execution paths, and on the precise definition and measurement protocol for STER=0. Without these, it is not possible to replicate or assess whether the CPU-path violation is robust or an artifact of the specific setup.

    Authors: We have substantially expanded the Methods section with a new subsection on experimental controls. The adversarial load was constructed from a fixed suite of background processes (matrix multiplications and I/O operations) executed concurrently on the shared SoC; identical load scripts and process priorities were used for both the TensorRT GPU and ONNX Runtime CPU paths to guarantee equivalent contention. STER is defined as the fraction of inferences in which the model’s predicted probability for the ground-truth class falls below a pre-specified clinical safety threshold (0.95); it was measured by logging every inference output against held-out validation labels and computing the exceedance rate over the full trial window. Both paths yielded STER = 0 under these conditions. The added protocol enables direct replication and robustness checks. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely experimental report of controlled measurements on the same Jetson Orin Nano hardware: identical MobileNetV2 models run under identical adversarial load on GPU (TensorRT FP16) versus CPU (ONNX Runtime FP32) paths, with STER=0 on both but CPU latency violating the 10 Hz budget. No equations, fitted parameters, derivations, or self-citations appear in the provided text. The central claim of structural independence between accuracy validation, STER=0, and timing constraints is established directly by the existence proof of divergent outcomes on the reported data, with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the experimental setup being representative of medical deployment and on STER serving as a sufficient proxy for output safety; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The chosen combined adversarial load on the Jetson Orin Nano represents realistic deployment conditions for edge AI medical devices
    Invoked to produce the observed latency degradation while STER remains zero.
  • domain assumption Zero Safety-Threshold Exceedance Rate indicates output stability adequate for clinical safety
    Used to assert that the CPU path remains safe despite timing violation.

pith-pipeline@v0.9.0 · 5523 in / 1406 out tokens · 84810 ms · 2026-05-08T04:59:35.072306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations. Draft Guidance,

    FDA, “Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations. Draft Guidance,” Docket FDA-2024-D-4488, Jan. 2025

  2. [2]

    Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy,

    J. Hao, P. Subedi, L. Ramaswamy, and I. K. Kim, “Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy,”ACM Trans. Internet Technol., vol. 23, no. 1, pp. 1–33, Feb. 2023

  3. [3]

    Performance Isolation for Inference Processes in Edge GPU Systems,

    J. J. Martín, J. Flich, and C. Hernández, “Performance Isolation for Inference Processes in Edge GPU Systems,” arXiv:2601.07600, Jan. 2026

  4. [4]

    DeepEdgeBench: Benchmarking Deep Neural Networks on Edge Devices,

    S. P. Baller, A. Jindal, M. Chadha, and M. Gerndt, “DeepEdgeBench: Benchmarking Deep Neural Networks on Edge Devices,” inProc. IEEE IC2E, 2021, pp. 20–30

  5. [5]

    Edge Devices Inference Performance Comparison,

    M. Tobiasz et al., “Edge Devices Inference Performance Comparison,” arXiv:2306.12093, Jun. 2023

  6. [6]

    Increasing Safety of Neural Networks in Medical Devices,

    B. A. Becker, “Increasing Safety of Neural Networks in Medical Devices,” inProc. SAFECOMP Workshops, LNCS vol. 11699, Springer, 2019, pp. 91–101

  7. [7]

    The Worst-Case Execution Time Problem—Overview of Methods and Survey of Tools,

    R. Wilhelm et al., “The Worst-Case Execution Time Problem—Overview of Methods and Survey of Tools,”ACM Trans. Embed. Comput. Syst., vol. 7, no. 3, pp. 36:1–36:53, Apr. 2008. 9

  8. [8]

    Medical device software: Software life cycle processes,

    IEC 62304:2006+AMD1:2015, “Medical device software: Software life cycle processes,” IEC, Geneva, 2015

  9. [9]

    Medical devices: Application of risk management to medical devices,

    ISO 14971:2019, “Medical devices: Application of risk management to medical devices,” ISO, Geneva, 2019

  10. [10]

    stress-ng: Tool to Load and Stress a Computer System,

    C. I. King, “stress-ng: Tool to Load and Stress a Computer System,” GitHub, 2023. [Online]. Available:https://github.com/ColinIanKing/stress-ng

  11. [11]

    MobileNetV2: Inverted Residuals and Linear Bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” inProc. IEEE/CVF CVPR, 2018, pp. 4510–4520

  12. [12]

    TensorRT Developer Guide,

    NVIDIA Corporation, “TensorRT Developer Guide,” NVIDIA Developer Documen- tation, 2024. [Online]. Available: https://docs.nvidia.com/deeplearning/tensorrt/ developer-guide/

  13. [13]

    ONNX Runtime: Cross-Platform Inference Accelerator,

    Microsoft Corporation, “ONNX Runtime: Cross-Platform Inference Accelerator,” GitHub, 2024. [Online]. Available:https://github.com/microsoft/onnxruntime

  14. [14]

    Early Recalls and Clinical Validation Gaps in Artificial Intelligence-Enabled Medical Devices,

    B. Lee et al., “Early Recalls and Clinical Validation Gaps in Artificial Intelligence-Enabled Medical Devices,”JAMA Health Forum, vol. 6, no. 8, p. e253172, Aug. 2025. 10