pith. sign in

arxiv: 1907.06632 · v1 · pith:7R5V4KE4new · submitted 2019-07-13 · 💻 cs.LG · cs.SE

Metamorphic Testing of a Deep Learning based Forecaster

Pith reviewed 2026-05-24 21:53 UTC · model grok-4.3

classification 💻 cs.LG cs.SE
keywords metamorphic testingdeep learningLSTMforecastingsoftware testingmutation testingcorrelation detectionoutage prediction
0
0 comments X

The pith

Metamorphic testing with 19 relations uncovers 8 unknown issues in a live deep learning outage forecaster and catches 65.9 percent of mutated bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper applies metamorphic testing to a deep learning forecasting application that predicts future system outages from past characteristics such as memory allocation. It defines 19 metamorphic relations for two components: statistical detection of correlations between system variables and LSTM-based prediction of future values. The relations allow verification of behavior under input transformations without needing an exact expected output for each test case. When run on the production application the relations surfaced 8 previously undetected issues. Mutation testing on a reference LSTM implementation showed the relations detected 65.9 percent of the injected faults.

Core claim

The paper claims that 19 metamorphic relations developed for the correlation detection and LSTM forecasting components can effectively test the deep learning based forecaster. When executed on the actual application, they uncovered 8 unknown issues. When applied to a reference implementation with hypothetical bugs generated via mutation testing, the relations caught 65.9 percent of those bugs.

What carries the argument

Metamorphic Relations that specify expected output changes under defined input transformations, applied separately to correlation detection and to the LSTM predictor.

If this is right

  • The same set of relations can be reused to test future versions or retrained instances of the outage forecaster.
  • The two evaluation settings together demonstrate that metamorphic relations can both find real faults in production code and measure coverage against artificial faults.
  • Proofs and algorithms supplied for some relations make the method repeatable for similar statistical and LSTM modules.
  • The approach separates testing of the correlation step from testing of the LSTM step, allowing independent verification of each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to other deep-learning time-series models by deriving analogous relations for their specific input-output behaviors.
  • Relations developed here might serve as a seed set when applying metamorphic testing to related forecasting tasks such as resource demand prediction.
  • Periodic re-execution of the relations after model retraining could detect degradation in the forecaster's statistical properties.

Load-bearing premise

The 19 metamorphic relations accurately capture the intended properties of the correlation detection and LSTM components such that violations reliably indicate faults as opposed to false positives or incomplete coverage.

What would settle it

A concrete case where one of the 19 relations is violated by the correct output of the forecaster, or where a known fault in the correlation or LSTM component evades detection by all relations.

Figures

Figures reproduced from arXiv: 1907.06632 by Anurag Dwarakanath, Arijit Naskar, Koushik MV, Manish Ahuja, Sanjay Podder, Silja Vinu.

Figure 1
Figure 1. Figure 1: Introduce a feature with a constant value so that the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A single large outlier has changed the results from a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample training and validation data used in the refer [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of an input data sequence and correspond [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Adding a constant to training and validation data. We [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A plot of the number of TIME STEPS versus the loss when the frequencies with low information in the data are removed. We choose a TIME STEPS around the region where the curve exhibits drastic changes in slope (around 25). Algorithm 1: Algorithm to choose the TIME STEPS needed. Input: Set of Input Validation Data X Output: Information loss versus different values of TIME STEPS 1 Assign Loss = 0 2 for timeSt… view at source ↗
read the original abstract

In this paper, we present the Metamorphic Testing of an in-use deep learning based forecasting application. The application looks at the past data of system characteristics (e.g. `memory allocation') to predict outages in the future. We focus on two statistical / machine learning based components - a) detection of co-relation between system characteristics and b) estimating the future value of a system characteristic using an LSTM (a deep learning architecture). In total, 19 Metamorphic Relations have been developed and we provide proofs & algorithms where applicable. We evaluated our method through two settings. In the first, we executed the relations on the actual application and uncovered 8 issues not known before. Second, we generated hypothetical bugs, through Mutation Testing, on a reference implementation of the LSTM based forecaster and found that 65.9% of the bugs were caught through the relations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript applies metamorphic testing to an in-use deep learning-based forecasting application that predicts system outages from historical characteristics. It develops 19 metamorphic relations targeting correlation detection between system metrics and LSTM-based forecasting, supplies proofs and algorithms for the relations where applicable, and evaluates the approach in two ways: direct execution on the live application (uncovering 8 previously unknown issues) and mutation testing on a reference LSTM implementation (65.9% kill rate).

Significance. If the relations are sound, the work supplies a concrete, reproducible case study of metamorphic testing applied to production ML components, with direct evidence of fault detection in both real and synthetic settings. The explicit provision of proofs for the relations is a notable strength that directly addresses the validity of the testing oracle.

major comments (1)
  1. [Abstract] Abstract: the reported 65.9% mutation kill rate is given without the total number of mutants generated, the mutation operators employed, or a breakdown by operator or component. This information is required to determine whether the kill rate reflects adequate coverage of the LSTM forecaster or is sensitive to the particular choice of mutants.
minor comments (1)
  1. A summary table listing all 19 metamorphic relations, their target component (correlation detection or LSTM), and whether a proof or algorithm is supplied would improve readability and allow readers to assess coverage at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 65.9% mutation kill rate is given without the total number of mutants generated, the mutation operators employed, or a breakdown by operator or component. This information is required to determine whether the kill rate reflects adequate coverage of the LSTM forecaster or is sensitive to the particular choice of mutants.

    Authors: We agree that the abstract would benefit from additional context on the mutation testing experiment. In the revised manuscript we will update the abstract to report the total number of mutants generated, the mutation operators applied, and a concise breakdown by operator and by component (correlation detection vs. LSTM forecasting). These details already exist in the body of the paper and can be summarized within the abstract's length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is an empirical case study paper that develops 19 metamorphic relations (with proofs and algorithms supplied where applicable) for correlation detection and LSTM forecasting components, then evaluates them via direct execution on a live application (finding 8 previously unknown issues) and via mutation testing on a reference implementation (65.9% mutation kill rate). No derivations, equations, fitted parameters, or predictions appear that reduce reported outcomes to inputs by construction; the soundness of the relations is addressed by the supplied proofs rather than by self-citation chains or definitional loops. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical software-testing study; it introduces no free parameters, mathematical axioms, or invented entities. The metamorphic relations rest on domain assumptions about expected system behavior rather than new postulates.

pith-pipeline@v0.9.0 · 5694 in / 1201 out tokens · 29623 ms · 2026-05-24T21:53:37.633002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 11 internal anchors

  1. [1]

    Identifying implementation bugs in machine learning based image classifiers using metamorphic testing,

    A. Dwarakanath, M. Ahuja, S. Sikand, R. M. Rao, R. P. J. C. Bose, N. Dubash, and S. Podder, “Identifying implementation bugs in machine learning based image classifiers using metamorphic testing,” in Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis , ser. ISSTA 2018. New York, NY , USA: ACM, 2018, pp. 118–128. [O...

  2. [2]

    A survey on metamorphic testing,

    S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cort ´es, “A survey on metamorphic testing,” IEEE Transactions on software engineering , vol. 42, no. 9, pp. 805–824, 2016

  3. [3]

    Non-determinism docs,

    jkschin, “Non-determinism docs,” https://github.com/tensorflow/ tensorflow/pull/10636, 2017, [Online; accessed 15-Jan-2018]

  4. [4]

    About deterministic behaviour of gpu implementation of tensorflow,

    antares1987, “About deterministic behaviour of gpu implementation of tensorflow,” https://github.com/tensorflow/tensorflow/issues/12871, 2017, [Online; accessed 15-Jan-2018]

  5. [5]

    Towards evaluating the robustness of neural networks,

    N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP) . IEEE, 2017, pp. 39–57

  6. [6]

    An analysis and survey of the development of mutation testing,

    Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering , vol. 37, no. 5, pp. 649–678, 2011

  7. [7]

    Mutpy 0.4.0,

    P. S. Foundation, “Mutpy 0.4.0,” https://pypi.python.org/pypi/MutPy/0. 4.0, 2013, [Online; accessed 15-Jan-2018]

  8. [8]

    Avoiding bugs in machine learning code,

    B. Kuhn, “Avoiding bugs in machine learning code,” https://www. benkuhn.net/ml-bugs, 2014, [Online; accessed 9-Jan-2018]

  9. [9]

    Developing bug-free machine learning systems with formal mathematics,

    D. Selsam, P. Liang, and D. L. Dill, “Developing bug-free machine learning systems with formal mathematics,” in International Conference on Machine Learning , 2017, pp. 3047–3056

  10. [10]

    Enhancing the reliability of out-of-distribution image detection in neural networks

    S. Liang, Y . Li, and R. Srikant, “Enhancing the reliability of out- of-distribution image detection in neural networks,” arXiv preprint arXiv:1706.02690, 2017

  11. [11]

    Measuring the tendency of CNNs to Learn Surface Statistical Regularities

    J. Jo and Y . Bengio, “Measuring the tendency of CNNs to Learn Surface Statistical Regularities,” no. 1, 2017. [Online]. Available: http://arxiv.org/abs/1711.11561

  12. [12]

    Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

    T. Bolukbasi, K.-W. Chang, J. Zou, V . Saligrama, and A. Kalai, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” pp. 1–25, 2016. [Online]. Available: http://arxiv.org/abs/1607.06520

  13. [13]

    "Why Should I Trust You?": Explaining the Predictions of Any Classifier

    M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why Should I Trust You?”: Explaining the Predictions of Any Classifier,” 2016. [Online]. Available: http://arxiv.org/abs/1602.04938

  14. [14]

    A rotation and a translation suffice: Fooling cnns with simple transformations

    L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry, “A rotation and a translation suffice: Fooling cnns with simple transformations,” arXiv preprint arXiv:1712.02779 , 2017

  15. [15]

    Comparing deep neural networks against humans: object recognition when the signal gets weaker

    R. Geirhos, D. H. Janssen, H. H. Sch ¨utt, J. Rauber, M. Bethge, and F. A. Wichmann, “Comparing deep neural networks against hu- mans: object recognition when the signal gets weaker,” arXiv preprint arXiv:1706.06969, 2017

  16. [16]

    DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars

    Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated test- ing of deep-neural-network-driven autonomous cars,” arXiv preprint arXiv:1708.08559, 2017

  17. [17]

    DeepXplore: Automated Whitebox Testing of Deep Learning Systems

    K. Pei, Y . Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” arXiv preprint arXiv:1705.06640 , 2017

  18. [18]

    The Elephant in the Room

    A. Rosenfeld, R. Zemel, and J. K. Tsotsos, “The elephant in the room,” arXiv preprint arXiv:1808.03305 , 2018

  19. [19]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013

  20. [20]

    Practical Black-Box Attacks against Machine Learning

    N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against deep learning systems using adversarial examples,” arXiv preprint arXiv:1602.02697 , 2016

  21. [21]

    Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

    J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.-W. Chang, “Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints,” pp. 2979–2989, 2017. [Online]. Available: http://arxiv.org/abs/1707.09457

  22. [22]

    Iterative Orthogonal Feature Projection for Diagnosing Bias in Black-Box Models

    J. Adebayo and L. Kagal, “Iterative orthogonal feature projection for diagnosing bias in black-box models,” arXiv preprint arXiv:1611.04967, 2016

  23. [23]

    Application of metamorphic testing to supervised classifiers,

    X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y . Chen, “Application of metamorphic testing to supervised classifiers,” in Quality Software,

  24. [24]

    9th International Conference on

    QSIC’09. 9th International Conference on . IEEE, 2009, pp. 135–144

  25. [25]

    Testing and validating machine learning classifiers by metamorphic testing,

    X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y . Chen, “Testing and validating machine learning classifiers by metamorphic testing,” Journal of Systems and Software , vol. 84, no. 4, pp. 544–558, 2011

  26. [26]

    Empirical evaluation of approaches to testing applications without test oracles,

    C. Murphy and G. Kaiser, “Empirical evaluation of approaches to testing applications without test oracles,” Dep. of Computer Science, Columbia University, Tech. Rep. CUCS-039-09 , 2010