pith. sign in

arxiv: 2606.26492 · v1 · pith:XZI7QYBFnew · submitted 2026-06-25 · 💻 cs.SE · cs.AI· cs.LG

Evaluation-Strategy Gap in Fault Diagnosis of Deep Learning Programs

Pith reviewed 2026-06-26 04:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords fault diagnosisdeep learning programsevaluation gapwithin-program cross-validationprogram hold-outcurvature featuresDynFault corpusbalanced accuracy
0
0 comments X

The pith

Fault diagnosis techniques for deep learning programs drop 0.19 in balanced accuracy on entirely new programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how fault diagnosis methods for DL training failures perform when tested on programs never seen during development. Standard within-program cross-validation can overestimate accuracy because features often encode program-specific patterns. Using the DynFault corpus of 5542 injected-fault traces from 38 real DL programs, the authors quantify a 0.19 balanced-accuracy gap between the two evaluation strategies. The gap traces to program-level structure in the runtime features. Curvature features retain diagnostic value on unseen programs while optimizer and activation features lose effectiveness outside the training set.

Core claim

Existing fault diagnosis techniques for DL programs exhibit a gap of 0.190 in balanced accuracy when moving from within-program evaluation to holding out whole programs. The gap stems from program-level structure in the features. Curvature features remain useful for instability detection on unseen programs, whereas optimizer and activation features are effective only on programs seen during training.

What carries the argument

Program hold-out evaluation, which exposes the generalization failure caused by program-level structure in runtime trace features.

If this is right

  • Diagnosis systems must be re-evaluated with program hold-out to reflect deployment conditions.
  • Curvature features can be used directly for instability detection in previously unseen programs.
  • Optimizer and activation features require retraining or adaptation when a new program is introduced.
  • Feature selection for fault diagnosis should prioritize signals that avoid encoding program identity.
  • Benchmarks relying only on within-program splits systematically overestimate reliability on new codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams maintaining diagnosis tools for evolving DL code may need to gather fresh labeled traces for each major new program.
  • Combining curvature features with program-agnostic reformulations of other features could narrow the observed gap.
  • The gap implies that production monitoring of novel DL training runs carries higher uncertainty than current published numbers suggest.
  • Extending the analysis to programs written in additional frameworks or domains would test whether the gap size is stable.

Load-bearing premise

The DynFault corpus of 5542 fault-injected traces from 38 real-world DL programs is representative of real deployment failures and program structures.

What would settle it

Re-running the same techniques on a new, independently collected set of DL programs and observing no accuracy difference between within-program and hold-out evaluation would falsify the claimed gap.

Figures

Figures reproduced from arXiv: 2606.26492 by Sigma Jahan.

Figure 1
Figure 1. Figure 1: Study design overview This threshold and computation rule originate from the DEFault corpus construction protocol [8] and are applied identically (see our replication package [17]). To isolate mismatch effects from catastrophic failures, we restrict this task to stable traces only (i.e., no loss explosion in any epoch). C. Feature Extraction For each scalar metric channel ze observed over the first k logge… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation-strategy gap across observation windows [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Program-level structure controls at k=5 (b) Within-program label permutation. Fig. 3b shows that permuting fault labels within each program keeps within￾program CV accuracy well above chance. Across 25 per￾mutations, median permuted accuracy is 0.41, versus a 0.17 chance baseline (6-class uniform) and 0.47 for the unpermuted case. We found that a classifier achieves accuracy well above chance on within-pro… view at source ↗
Figure 5
Figure 5. Figure 5: Instability characterization (a) 96% of unstable runs explode at epoch 0 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Instability rate by epoch-0 quintile 0.0 0.2 0.4 0.6 0.8 1.0 Normalized layer depth 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 log10 (|H VP norm|) (epoch 0, resampled) Stable (n = 3458) Unstable (n = 179) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise HVP magnitude at epoch 0 Curvature-based triage rule. We also examined whether epoch-0 curvature geometry can group runs into interpretable categories before training proceeds. We cluster the 3,637 TABLE VII Curvature cluster patterns Cluster n Programs Unstable (%) Mismatch (%) Median HVP Median ∇-std 0 1346 19 0.5 49.1 6.1 × 10−1 1.4 × 10−2 1 121 9 100.0 3.3 5.4 × 106 6.8 × 105 2 367 8 0.8 28.… view at source ↗
Figure 9
Figure 9. Figure 9: Instability precision–recall curves under program-held-out evaluation [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Optimizer-feature logistic regression coefficients across within [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: T/V-mismatch ROC-AUC across observation windows [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Within- vs. cross-program accuracy for DEFault and DeepFD at [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
read the original abstract

Deep Learning (DL) programs can fail during training for many reasons, and diagnosing the cause is a costly and time-consuming maintenance task. Techniques for diagnosing such failures are commonly assessed using within-program cross-validation, which may be inadequate for deployment settings involving previously unseen programs. It is therefore necessary to assess how performance differs across these settings and to identify the causes of any performance gap in established fault diagnosis techniques for DL. We investigate this gap using DynFault, a corpus of 5,542 fault-injected training traces from 38 real-world DL programs. We found a gap of 0.190 in balanced accuracy for existing fault diagnosis techniques between within-program evaluation and holding out whole programs. We also found the gap comes from program-level structure in the features, which led us to examine two runtime feature sets, curvature features and optimizer features, and their behavior on unseen programs. We found that curvature features are useful for instability detection on unseen programs, while optimizer and activation features help only on programs seen during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that fault diagnosis techniques for deep learning programs exhibit a 0.190 gap in balanced accuracy between within-program cross-validation and whole-program hold-out evaluation. This gap is measured on the DynFault corpus of 5,542 fault-injected training traces from 38 real-world DL programs and is attributed to program-level structure in the features. The authors further examine curvature, optimizer, and activation feature sets, concluding that curvature features remain useful for instability detection on unseen programs while optimizer and activation features help only on programs seen during training.

Significance. If the result holds, the work provides a concrete empirical demonstration that standard within-program evaluation overestimates performance for DL fault diagnosis in deployment settings involving new programs. The distinction between feature sets (curvature vs. optimizer/activation) offers a actionable direction for designing more generalizable diagnosis methods. The scale of the corpus (38 programs, >5k traces) supplies a measurable baseline that future studies can build upon or replicate.

major comments (3)
  1. [Abstract / DynFault corpus description] Abstract and methods description of DynFault: the reported 0.190 balanced-accuracy gap and its attribution to program-level feature structure rest on the corpus of 5,542 injected faults, yet no details are supplied on fault-injection procedure, program selection criteria, trace collection protocol, or any statistical controls for program identity. Without these, it is impossible to determine whether the gap reflects genuine program-level structure or an artifact of the injection process.
  2. [Abstract] Abstract: the central explanatory claim that 'the gap comes from program-level structure in the features' is presented without supporting quantitative evidence (e.g., feature-distribution statistics across programs, ablation of program identity, or cross-program variance decomposition). This attribution is load-bearing for the subsequent feature-set analysis.
  3. [Abstract] Abstract: the evaluation is performed exclusively on artificially injected faults; no comparison against a set of naturally occurring (non-injected) DL training failures is reported. Consequently, both the magnitude of the gap and the differential utility of curvature features on unseen programs rest on an untested assumption that injected faults produce representative runtime traces.
minor comments (1)
  1. [Abstract] The abstract states a precise numerical gap (0.190) but does not indicate whether this is an average across multiple techniques or a single aggregate; clarifying the exact aggregation would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments, which highlight important aspects of clarity and evidence in our work on the evaluation-strategy gap for DL fault diagnosis. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses
  1. Referee: [Abstract / DynFault corpus description] Abstract and methods description of DynFault: the reported 0.190 balanced-accuracy gap and its attribution to program-level feature structure rest on the corpus of 5,542 injected faults, yet no details are supplied on fault-injection procedure, program selection criteria, trace collection protocol, or any statistical controls for program identity. Without these, it is impossible to determine whether the gap reflects genuine program-level structure or an artifact of the injection process.

    Authors: We agree that the abstract lacks sufficient detail on corpus construction. The full manuscript describes the DynFault corpus in Section 3, including program selection from popular open-source DL repositories, fault types injected via targeted code mutations, and trace collection using runtime instrumentation. To address the concern directly, we will revise the abstract to include a concise summary of these elements and add explicit statistical controls (e.g., program-stratified sampling and identity ablation checks) in the methods section of the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: the central explanatory claim that 'the gap comes from program-level structure in the features' is presented without supporting quantitative evidence (e.g., feature-distribution statistics across programs, ablation of program identity, or cross-program variance decomposition). This attribution is load-bearing for the subsequent feature-set analysis.

    Authors: We acknowledge that the abstract presents the attribution concisely without inline quantitative support. The manuscript body includes supporting analyses (feature distributions and cross-program comparisons), but to strengthen the load-bearing claim, we will incorporate additional quantitative evidence such as per-program feature distribution statistics, program-identity ablation results, and variance decomposition in a new subsection of the revised paper. revision: yes

  3. Referee: [Abstract] Abstract: the evaluation is performed exclusively on artificially injected faults; no comparison against a set of naturally occurring (non-injected) DL training failures is reported. Consequently, both the magnitude of the gap and the differential utility of curvature features on unseen programs rest on an untested assumption that injected faults produce representative runtime traces.

    Authors: The work deliberately employs controlled fault injection to enable reproducible, large-scale evaluation across 38 programs while isolating specific fault effects. We recognize that this leaves the representativeness of injected faults as an assumption. In the revision, we will add an explicit limitations subsection discussing this assumption, its rationale, and implications for deployment on natural failures, without claiming equivalence. revision: partial

standing simulated objections not resolved
  • Direct empirical comparison of injected versus naturally occurring DL training failures, as no such natural-fault corpus was collected or available for this study.

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper reports direct empirical measurements of balanced accuracy gaps on a fixed corpus of 5542 injected-fault traces across 38 programs. No derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The 0.190 gap is a computed statistic on held-out programs, not a quantity forced by construction from the inputs. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the fault-injected corpus and the validity of cross-program hold-out as a proxy for deployment generalization.

axioms (1)
  • domain assumption Fault injection into training traces produces failures representative of natural DL program faults
    The entire evaluation rests on the DynFault corpus built via fault injection.

pith-pipeline@v0.9.1-grok · 5698 in / 1119 out tokens · 33296 ms · 2026-06-26T04:42:53.666861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 24 canonical work pages

  1. [1]

    A comprehensive study on deep learning bug characteristics,

    M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study on deep learning bug characteristics,” inProceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2019, pp. 510–520.doi: 10.1145/3338906.3338955

  2. [2]

    Taxonomy of real faults in deep learning systems,

    N. Humbatova et al., “Taxonomy of real faults in deep learning systems,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE), 2020, pp. 1110–1121.doi: 10.1145/ 3377811.3380395

  3. [3]

    Visualizing the loss landscape of neural nets,

    H. Li et al., “Visualizing the loss landscape of neural nets,” inAdvances in Neural Information Processing Systems (NeurIPS), 2018

  4. [4]

    Ferreira, Rui Abreu, and Pedro Cruz

    R. Zhang et al., “An empirical study on program failures of deep learning jobs,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE), 2020, pp. 1159–1170. doi: 10.1145/3377811.3380362

  5. [5]

    Repairing deep neural networks: Fix patterns and challenges,

    M. J. Islam, R. Pan, G. Nguyen, and H. Rajan, “Repairing deep neural networks: Fix patterns and challenges,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE), 2020, pp. 1135–1146.doi: 10.1145/3377811.3380378

  6. [6]

    Deepdiagnosis: Automatically diagnosing faults and recommending actionable fixes in deep learning programs,

    M. Wardat, B. D. Cruz, W. Le, and H. Rajan, “Deepdiagnosis: Automatically diagnosing faults and recommending actionable fixes in deep learning programs,” inProceedings of the 44th International Conference on Software Engineering (ICSE), 2022, pp. 561–572.doi: 10.1145/3510003.3510071

  7. [7]

    Deepfd: Automated fault diagnosis and localization for deep learning programs,

    J. Cao, Y. Lu, M. Wen, and S. Cheung, “Deepfd: Automated fault diagnosis and localization for deep learning programs,” inProceedings of the 44th International Conference on Software Engineering (ICSE), 2022, pp. 573–585.doi: 10.1145/3510003.3510099

  8. [8]

    In: 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025

    S. Jahan, M. B. Shah, P. Mahbub, and M. M. Rahman, “Improved detection and diagnosis of faults in deep neural networks using hierarchical and explainable classification,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025, pp. 2944–2956.doi: 10.1109/ICSE55347.2025.00224

  9. [9]

    Coverage-enhanced fault diagnosis for deep learning programs: A learning-based approach with hybrid metrics,

    X. Qi, T. Zhu, and Y. Li, “Coverage-enhanced fault diagnosis for deep learning programs: A learning-based approach with hybrid metrics,” Information and Software Technology, vol. 173, p. 107 488, 2024.doi: 10.1016/j.infsof.2024.107488

  10. [10]

    Shortcut learning in deep neural networks,

    R. Geirhos et al., “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, pp. 665–673, 2020.doi: 10.1038/s42256- 020-00257-z

  11. [11]

    Cross-project defect prediction: A large scale experiment on data vs. domain vs. process,

    T. Zimmermann et al., “Cross-project defect prediction: A large scale experiment on data vs. domain vs. process,” inProceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2009, pp. 91–100.doi: 10.1145/1595696.1595713

  12. [12]

    An empirical comparison of model validation techniques for defect prediction models,

    C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “An empirical comparison of model validation techniques for defect prediction models,”IEEE Transactions on Software Engineering, vol. 43, no. 1, pp. 1–18, 2017.doi: 10.1109/TSE.2016.2584050

  13. [13]

    Ecography , volume =

    D. R. Roberts et al., “Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure,”Ecography, vol. 40, no. 8, pp. 913–929, 2017.doi: 10.1111/ecog.02881

  14. [14]

    An empirical study of the impact of data splitting decisions on the performance of AIOps solutions,

    Y. Lyu et al., “An empirical study of the impact of data splitting decisions on the performance of AIOps solutions,”ACM Transactions on Software Engineering and Methodology, vol. 30, no. 4, pp. 1–38, 2021.doi: 10.1145/3447876

  15. [15]

    DeepLocalize: Fault localization for deep neural networks,

    X. Zhang, J. Zhai, S. Ma, and C. Shen, “AutoTrainer: An automatic DNN training problem detection and repair system,” inProceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021, pp. 359–371.doi: 10.1109/ICSE43902.2021.00043

  16. [16]

    UMLAUT: Debugging deep learning programs using program structure and model behavior,

    E. Schoop, F. Huang, and B. Hartmann, “UMLAUT: Debugging deep learning programs using program structure and model behavior,” inProceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI), 2021.doi: 10.1145/3411764.3445538

  17. [17]

    Jahan,Replication Package for the Evaluation Strategy Gap Study, https : / / github

    S. Jahan,Replication Package for the Evaluation Strategy Gap Study, https : / / github . com / SigmaJahan / Evaluation - Strategy- Gap - Study, Accessed: 2026-06-24, 2026

  18. [18]

    The impact of using biased performance metrics on software defect prediction research

    Y. Yang, T. He, Z. Xia, and Y. Feng, “A comprehensive empirical study on bug characteristics of deep learning frameworks,”Information and Software Technology, p. 107 004, 2022.doi: 10.1016/j.infsof. 2022.107004

  19. [19]

    Automated patch backporting in Linux (experience paper),

    N. Humbatova, G. Jahangirova, and P. Tonella, “Deepcrime: Mutation testing of deep learning systems based on real faults,” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2021, pp. 67–78.doi: 10.1145/3460319.3464825

  20. [20]

    Generalized linear models,

    J. A. Nelder and R. W. Wedderburn, “Generalized linear models,” Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 135, no. 3, pp. 370–384, 1972

  21. [21]

    An investigation into neural net optimization via hessian eigenvalue density,

    B. Ghorbani, S. Krishnan, and Y. Xiao, “An investigation into neural net optimization via hessian eigenvalue density,” inProceedings of the International Conference on Machine Learning (ICML), 2019, pp. 2232–2241

  22. [22]

    Can hessian-based insights support fault diagnosis in attention-based models?

    S. Jahan and M. M. Rahman, “Can hessian-based insights support fault diagnosis in attention-based models?” InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE), New York, NY, USA: Association for Computing Machinery, 2025, pp. 676–680.doi: 10.1145/3696630.3728522

  23. [23]

    A survey on deep learning for multimodal data fusion,

    B. A. Pearlmutter, “Fast exact multiplication by the hessian,”Neural Computation, vol. 6, no. 1, pp. 147–160, 1994.doi: 10.1162/neco. 1994.6.1.147

  24. [24]

    Deep learning via hessian-free optimization,

    J. Martens, “Deep learning via hessian-free optimization,” inPro- ceedings of the 27th International Conference on Machine Learning (ICML), 2010, pp. 735–742

  25. [25]

    Gradient descent on neural networks typically occurs at the edge of stability,

    J. M. Cohen et al., “Gradient descent on neural networks typically occurs at the edge of stability,”arXiv preprint arXiv:2103.00065, 2021

  26. [26]

    Understanding gradient descent on the edge of stability in deep learning,

    S. Arora, Z. Li, and A. Panigrahi, “Understanding gradient descent on the edge of stability in deep learning,” inProceedings of the 39th International Conference on Machine Learning (ICML), PMLR, 2022, pp. 948–1024

  27. [27]

    A loss curvature perspective on training instability in deep learning,

    J. Gilmer et al., “A loss curvature perspective on training instability in deep learning,”arXiv preprint arXiv:2110.04369, 2021

  28. [28]

    Cockpit: A practical debugging tool for the training of deep neural networks,

    F. Schneider, F. Dangel, and P. Hennig, “Cockpit: A practical debugging tool for the training of deep neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021

  29. [29]

    Bootstrap-based improvements for inference with clustered errors,

    A. C. Cameron, J. B. Gelbach, and D. L. Miller, “Bootstrap-based improvements for inference with clustered errors,”The Review of Economics and Statistics, vol. 90, no. 3, pp. 414–427, 2008.doi: 10.1162/rest.90.3.414

  30. [30]

    A kernel two-sample test,

    A. Gretton et al., “A kernel two-sample test,”Journal of Machine Learning Research, vol. 13, no. 25, pp. 723–773, 2012

  31. [31]

    Hidden technical debt in machine learning systems,

    D. Sculley et al., “Hidden technical debt in machine learning systems,” inAdvances in Neural Information Processing Systems, 2015, pp. 2503– 2511

  32. [32]

    An empirical study on TensorFlow program bugs,

    Y. Zhang et al., “An empirical study on TensorFlow program bugs,” inProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2018, pp. 129–140.doi: 10.1145/3213846.3213866

  33. [33]

    DeepLocalize: Fault localization for deep neural networks,

    M. Wardat, W. Le, and H. Rajan, “DeepLocalize: Fault localization for deep neural networks,” inProceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021, pp. 251–262.doi: 10.1109/ICSE43902.2021.00034

  34. [34]

    Detecting numerical bugs in neural network architectures,

    Y. Zhang et al., “Detecting numerical bugs in neural network architectures,” inProceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020.doi: 10 . 1145 / 3368089 . 3409720

  35. [35]

    In: Chandra, S., Blincoe, K., Tonella, P

    S. Ahmed et al., “Design by contract for deep learning apis,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineer- ing (ESEC/FSE), 2023.doi: 10.1145/3611643.3616247

  36. [36]

    Deepmutation: Mutation testing of deep learning systems,

    L. Ma et al., “Deepmutation: Mutation testing of deep learning systems,” inProceedings of the IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), 2018, pp. 100–111.doi: 10.1109/ ISSRE.2018.00021

  37. [37]

    An empirical study of the realism of mutants in deep learning,

    Z. Ahmed, P. Makedonski, and J. Grabowski, “An empirical study of the realism of mutants in deep learning,”arXiv preprint arXiv:2512.16741, 2025

  38. [38]

    A comparative study to benchmark cross-project defect prediction approaches,

    S. Herbold, A. Trautsch, and J. Grabowski, “A comparative study to benchmark cross-project defect prediction approaches,”IEEE Transactions on Software Engineering, vol. 44, no. 9, pp. 811–833, 2018.doi: 10.1109/TSE.2017.2724538

  39. [39]

    Software engineering for machine learning: A case study,

    S. Amershi et al., “Software engineering for machine learning: A case study,” in2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE-SEIP), 2019.doi: 10.1109/ICSE-SEIP.2019.00042