Metamorphic Testing of a Deep Learning based Forecaster

Anurag Dwarakanath; Arijit Naskar; Koushik MV; Manish Ahuja; Sanjay Podder; Silja Vinu

arxiv: 1907.06632 · v1 · pith:7R5V4KE4new · submitted 2019-07-13 · 💻 cs.LG · cs.SE

Metamorphic Testing of a Deep Learning based Forecaster

Anurag Dwarakanath , Manish Ahuja , Sanjay Podder , Silja Vinu , Arijit Naskar , Koushik MV This is my paper

Pith reviewed 2026-05-24 21:53 UTC · model grok-4.3

classification 💻 cs.LG cs.SE

keywords metamorphic testingdeep learningLSTMforecastingsoftware testingmutation testingcorrelation detectionoutage prediction

0 comments

The pith

Metamorphic testing with 19 relations uncovers 8 unknown issues in a live deep learning outage forecaster and catches 65.9 percent of mutated bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper applies metamorphic testing to a deep learning forecasting application that predicts future system outages from past characteristics such as memory allocation. It defines 19 metamorphic relations for two components: statistical detection of correlations between system variables and LSTM-based prediction of future values. The relations allow verification of behavior under input transformations without needing an exact expected output for each test case. When run on the production application the relations surfaced 8 previously undetected issues. Mutation testing on a reference LSTM implementation showed the relations detected 65.9 percent of the injected faults.

Core claim

The paper claims that 19 metamorphic relations developed for the correlation detection and LSTM forecasting components can effectively test the deep learning based forecaster. When executed on the actual application, they uncovered 8 unknown issues. When applied to a reference implementation with hypothetical bugs generated via mutation testing, the relations caught 65.9 percent of those bugs.

What carries the argument

Metamorphic Relations that specify expected output changes under defined input transformations, applied separately to correlation detection and to the LSTM predictor.

If this is right

The same set of relations can be reused to test future versions or retrained instances of the outage forecaster.
The two evaluation settings together demonstrate that metamorphic relations can both find real faults in production code and measure coverage against artificial faults.
Proofs and algorithms supplied for some relations make the method repeatable for similar statistical and LSTM modules.
The approach separates testing of the correlation step from testing of the LSTM step, allowing independent verification of each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to other deep-learning time-series models by deriving analogous relations for their specific input-output behaviors.
Relations developed here might serve as a seed set when applying metamorphic testing to related forecasting tasks such as resource demand prediction.
Periodic re-execution of the relations after model retraining could detect degradation in the forecaster's statistical properties.

Load-bearing premise

The 19 metamorphic relations accurately capture the intended properties of the correlation detection and LSTM components such that violations reliably indicate faults as opposed to false positives or incomplete coverage.

What would settle it

A concrete case where one of the 19 relations is violated by the correct output of the forecaster, or where a known fault in the correlation or LSTM component evades detection by all relations.

Figures

Figures reproduced from arXiv: 1907.06632 by Anurag Dwarakanath, Arijit Naskar, Koushik MV, Manish Ahuja, Sanjay Podder, Silja Vinu.

**Figure 2.** Figure 2: A single large outlier has changed the results from a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sample training and validation data used in the refer [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: An example of an input data sequence and correspond [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Adding a constant to training and validation data. We [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: A plot of the number of TIME STEPS versus the loss when the frequencies with low information in the data are removed. We choose a TIME STEPS around the region where the curve exhibits drastic changes in slope (around 25). Algorithm 1: Algorithm to choose the TIME STEPS needed. Input: Set of Input Validation Data X Output: Information loss versus different values of TIME STEPS 1 Assign Loss = 0 2 for timeSt… view at source ↗

read the original abstract

In this paper, we present the Metamorphic Testing of an in-use deep learning based forecasting application. The application looks at the past data of system characteristics (e.g. `memory allocation') to predict outages in the future. We focus on two statistical / machine learning based components - a) detection of co-relation between system characteristics and b) estimating the future value of a system characteristic using an LSTM (a deep learning architecture). In total, 19 Metamorphic Relations have been developed and we provide proofs & algorithms where applicable. We evaluated our method through two settings. In the first, we executed the relations on the actual application and uncovered 8 issues not known before. Second, we generated hypothetical bugs, through Mutation Testing, on a reference implementation of the LSTM based forecaster and found that 65.9% of the bugs were caught through the relations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Metamorphic testing found eight real issues in a production LSTM outage forecaster and killed 65.9% of mutations on a reference version.

read the letter

The main takeaway is that the authors applied metamorphic testing to a live deep-learning forecaster that predicts system outages from past metrics. They created nineteen relations aimed at the correlation-detection step and the LSTM prediction step, executed them on the actual application, and surfaced eight previously unknown issues. On a reference LSTM implementation they also ran mutation testing and caught 65.9 percent of the injected faults. Proofs and algorithms are supplied for the relations where they apply. This is a concrete, dual-evaluation case study rather than a broad theoretical claim. The practical strength is the combination of real-system execution and quantitative mutation results; both are reported with specific numbers instead of vague assertions. The scope stays narrow—one forecasting pipeline—so the paper does not position the relations as a general solution for all time-series models. A point worth checking in the full text is whether the relations were developed before seeing the mutation outcomes or whether any of the eight flagged issues later proved to be false positives once the developers examined them. The 65.9 percent kill rate is useful but indicates the relations do not cover every possible fault. Readers who maintain or test production forecasting systems that combine statistical checks with LSTMs will get the most from the example. It is worth sending to peer review because the claims rest on direct runs against both live code and controlled mutations rather than unverified assertions.

Referee Report

1 major / 1 minor

Summary. The manuscript applies metamorphic testing to an in-use deep learning-based forecasting application that predicts system outages from historical characteristics. It develops 19 metamorphic relations targeting correlation detection between system metrics and LSTM-based forecasting, supplies proofs and algorithms for the relations where applicable, and evaluates the approach in two ways: direct execution on the live application (uncovering 8 previously unknown issues) and mutation testing on a reference LSTM implementation (65.9% kill rate).

Significance. If the relations are sound, the work supplies a concrete, reproducible case study of metamorphic testing applied to production ML components, with direct evidence of fault detection in both real and synthetic settings. The explicit provision of proofs for the relations is a notable strength that directly addresses the validity of the testing oracle.

major comments (1)

[Abstract] Abstract: the reported 65.9% mutation kill rate is given without the total number of mutants generated, the mutation operators employed, or a breakdown by operator or component. This information is required to determine whether the kill rate reflects adequate coverage of the LSTM forecaster or is sensitive to the particular choice of mutants.

minor comments (1)

A summary table listing all 19 metamorphic relations, their target component (correlation detection or LSTM), and whether a proof or algorithm is supplied would improve readability and allow readers to assess coverage at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 65.9% mutation kill rate is given without the total number of mutants generated, the mutation operators employed, or a breakdown by operator or component. This information is required to determine whether the kill rate reflects adequate coverage of the LSTM forecaster or is sensitive to the particular choice of mutants.

Authors: We agree that the abstract would benefit from additional context on the mutation testing experiment. In the revised manuscript we will update the abstract to report the total number of mutants generated, the mutation operators applied, and a concise breakdown by operator and by component (correlation detection vs. LSTM forecasting). These details already exist in the body of the paper and can be summarized within the abstract's length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is an empirical case study paper that develops 19 metamorphic relations (with proofs and algorithms supplied where applicable) for correlation detection and LSTM forecasting components, then evaluates them via direct execution on a live application (finding 8 previously unknown issues) and via mutation testing on a reference implementation (65.9% mutation kill rate). No derivations, equations, fitted parameters, or predictions appear that reduce reported outcomes to inputs by construction; the soundness of the relations is addressed by the supplied proofs rather than by self-citation chains or definitional loops. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical software-testing study; it introduces no free parameters, mathematical axioms, or invented entities. The metamorphic relations rest on domain assumptions about expected system behavior rather than new postulates.

pith-pipeline@v0.9.0 · 5694 in / 1201 out tokens · 29623 ms · 2026-05-24T21:53:37.633002+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 11 internal anchors

[1]

Identifying implementation bugs in machine learning based image classiﬁers using metamorphic testing,

A. Dwarakanath, M. Ahuja, S. Sikand, R. M. Rao, R. P. J. C. Bose, N. Dubash, and S. Podder, “Identifying implementation bugs in machine learning based image classiﬁers using metamorphic testing,” in Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis , ser. ISSTA 2018. New York, NY , USA: ACM, 2018, pp. 118–128. [O...

work page doi:10.1145/3213846.3213858 2018
[2]

A survey on metamorphic testing,

S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cort ´es, “A survey on metamorphic testing,” IEEE Transactions on software engineering , vol. 42, no. 9, pp. 805–824, 2016

work page 2016
[3]

Non-determinism docs,

jkschin, “Non-determinism docs,” https://github.com/tensorﬂow/ tensorﬂow/pull/10636, 2017, [Online; accessed 15-Jan-2018]

work page 2017
[4]

About deterministic behaviour of gpu implementation of tensorﬂow,

antares1987, “About deterministic behaviour of gpu implementation of tensorﬂow,” https://github.com/tensorﬂow/tensorﬂow/issues/12871, 2017, [Online; accessed 15-Jan-2018]

work page 2017
[5]

Towards evaluating the robustness of neural networks,

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP) . IEEE, 2017, pp. 39–57

work page 2017
[6]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering , vol. 37, no. 5, pp. 649–678, 2011

work page 2011
[7]

Mutpy 0.4.0,

P. S. Foundation, “Mutpy 0.4.0,” https://pypi.python.org/pypi/MutPy/0. 4.0, 2013, [Online; accessed 15-Jan-2018]

work page 2013
[8]

Avoiding bugs in machine learning code,

B. Kuhn, “Avoiding bugs in machine learning code,” https://www. benkuhn.net/ml-bugs, 2014, [Online; accessed 9-Jan-2018]

work page 2014
[9]

Developing bug-free machine learning systems with formal mathematics,

D. Selsam, P. Liang, and D. L. Dill, “Developing bug-free machine learning systems with formal mathematics,” in International Conference on Machine Learning , 2017, pp. 3047–3056

work page 2017
[10]

Enhancing the reliability of out-of-distribution image detection in neural networks

S. Liang, Y . Li, and R. Srikant, “Enhancing the reliability of out- of-distribution image detection in neural networks,” arXiv preprint arXiv:1706.02690, 2017

work page arXiv 2017
[11]

Measuring the tendency of CNNs to Learn Surface Statistical Regularities

J. Jo and Y . Bengio, “Measuring the tendency of CNNs to Learn Surface Statistical Regularities,” no. 1, 2017. [Online]. Available: http://arxiv.org/abs/1711.11561

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

T. Bolukbasi, K.-W. Chang, J. Zou, V . Saligrama, and A. Kalai, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” pp. 1–25, 2016. [Online]. Available: http://arxiv.org/abs/1607.06520

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why Should I Trust You?”: Explaining the Predictions of Any Classiﬁer,” 2016. [Online]. Available: http://arxiv.org/abs/1602.04938

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

A rotation and a translation sufﬁce: Fooling cnns with simple transformations

L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry, “A rotation and a translation sufﬁce: Fooling cnns with simple transformations,” arXiv preprint arXiv:1712.02779 , 2017

work page arXiv 2017
[15]

Comparing deep neural networks against humans: object recognition when the signal gets weaker

R. Geirhos, D. H. Janssen, H. H. Sch ¨utt, J. Rauber, M. Bethge, and F. A. Wichmann, “Comparing deep neural networks against hu- mans: object recognition when the signal gets weaker,” arXiv preprint arXiv:1706.06969, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars

Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated test- ing of deep-neural-network-driven autonomous cars,” arXiv preprint arXiv:1708.08559, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

K. Pei, Y . Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” arXiv preprint arXiv:1705.06640 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

The Elephant in the Room

A. Rosenfeld, R. Zemel, and J. K. Tsotsos, “The elephant in the room,” arXiv preprint arXiv:1808.03305 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

Practical Black-Box Attacks against Machine Learning

N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against deep learning systems using adversarial examples,” arXiv preprint arXiv:1602.02697 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.-W. Chang, “Men Also Like Shopping: Reducing Gender Bias Ampliﬁcation using Corpus-level Constraints,” pp. 2979–2989, 2017. [Online]. Available: http://arxiv.org/abs/1707.09457

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Iterative Orthogonal Feature Projection for Diagnosing Bias in Black-Box Models

J. Adebayo and L. Kagal, “Iterative orthogonal feature projection for diagnosing bias in black-box models,” arXiv preprint arXiv:1611.04967, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Application of metamorphic testing to supervised classiﬁers,

X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y . Chen, “Application of metamorphic testing to supervised classiﬁers,” in Quality Software,

work page
[24]

9th International Conference on

QSIC’09. 9th International Conference on . IEEE, 2009, pp. 135–144

work page 2009
[25]

Testing and validating machine learning classiﬁers by metamorphic testing,

X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y . Chen, “Testing and validating machine learning classiﬁers by metamorphic testing,” Journal of Systems and Software , vol. 84, no. 4, pp. 544–558, 2011

work page 2011
[26]

Empirical evaluation of approaches to testing applications without test oracles,

C. Murphy and G. Kaiser, “Empirical evaluation of approaches to testing applications without test oracles,” Dep. of Computer Science, Columbia University, Tech. Rep. CUCS-039-09 , 2010

work page 2010

[1] [1]

Identifying implementation bugs in machine learning based image classiﬁers using metamorphic testing,

A. Dwarakanath, M. Ahuja, S. Sikand, R. M. Rao, R. P. J. C. Bose, N. Dubash, and S. Podder, “Identifying implementation bugs in machine learning based image classiﬁers using metamorphic testing,” in Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis , ser. ISSTA 2018. New York, NY , USA: ACM, 2018, pp. 118–128. [O...

work page doi:10.1145/3213846.3213858 2018

[2] [2]

A survey on metamorphic testing,

S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cort ´es, “A survey on metamorphic testing,” IEEE Transactions on software engineering , vol. 42, no. 9, pp. 805–824, 2016

work page 2016

[3] [3]

Non-determinism docs,

jkschin, “Non-determinism docs,” https://github.com/tensorﬂow/ tensorﬂow/pull/10636, 2017, [Online; accessed 15-Jan-2018]

work page 2017

[4] [4]

About deterministic behaviour of gpu implementation of tensorﬂow,

antares1987, “About deterministic behaviour of gpu implementation of tensorﬂow,” https://github.com/tensorﬂow/tensorﬂow/issues/12871, 2017, [Online; accessed 15-Jan-2018]

work page 2017

[5] [5]

Towards evaluating the robustness of neural networks,

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP) . IEEE, 2017, pp. 39–57

work page 2017

[6] [6]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering , vol. 37, no. 5, pp. 649–678, 2011

work page 2011

[7] [7]

Mutpy 0.4.0,

P. S. Foundation, “Mutpy 0.4.0,” https://pypi.python.org/pypi/MutPy/0. 4.0, 2013, [Online; accessed 15-Jan-2018]

work page 2013

[8] [8]

Avoiding bugs in machine learning code,

B. Kuhn, “Avoiding bugs in machine learning code,” https://www. benkuhn.net/ml-bugs, 2014, [Online; accessed 9-Jan-2018]

work page 2014

[9] [9]

Developing bug-free machine learning systems with formal mathematics,

D. Selsam, P. Liang, and D. L. Dill, “Developing bug-free machine learning systems with formal mathematics,” in International Conference on Machine Learning , 2017, pp. 3047–3056

work page 2017

[10] [10]

Enhancing the reliability of out-of-distribution image detection in neural networks

S. Liang, Y . Li, and R. Srikant, “Enhancing the reliability of out- of-distribution image detection in neural networks,” arXiv preprint arXiv:1706.02690, 2017

work page arXiv 2017

[11] [11]

Measuring the tendency of CNNs to Learn Surface Statistical Regularities

J. Jo and Y . Bengio, “Measuring the tendency of CNNs to Learn Surface Statistical Regularities,” no. 1, 2017. [Online]. Available: http://arxiv.org/abs/1711.11561

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

T. Bolukbasi, K.-W. Chang, J. Zou, V . Saligrama, and A. Kalai, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” pp. 1–25, 2016. [Online]. Available: http://arxiv.org/abs/1607.06520

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why Should I Trust You?”: Explaining the Predictions of Any Classiﬁer,” 2016. [Online]. Available: http://arxiv.org/abs/1602.04938

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

A rotation and a translation sufﬁce: Fooling cnns with simple transformations

L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry, “A rotation and a translation sufﬁce: Fooling cnns with simple transformations,” arXiv preprint arXiv:1712.02779 , 2017

work page arXiv 2017

[15] [15]

Comparing deep neural networks against humans: object recognition when the signal gets weaker

R. Geirhos, D. H. Janssen, H. H. Sch ¨utt, J. Rauber, M. Bethge, and F. A. Wichmann, “Comparing deep neural networks against hu- mans: object recognition when the signal gets weaker,” arXiv preprint arXiv:1706.06969, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars

Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated test- ing of deep-neural-network-driven autonomous cars,” arXiv preprint arXiv:1708.08559, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

K. Pei, Y . Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” arXiv preprint arXiv:1705.06640 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

The Elephant in the Room

A. Rosenfeld, R. Zemel, and J. K. Tsotsos, “The elephant in the room,” arXiv preprint arXiv:1808.03305 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[20] [20]

Practical Black-Box Attacks against Machine Learning

N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against deep learning systems using adversarial examples,” arXiv preprint arXiv:1602.02697 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.-W. Chang, “Men Also Like Shopping: Reducing Gender Bias Ampliﬁcation using Corpus-level Constraints,” pp. 2979–2989, 2017. [Online]. Available: http://arxiv.org/abs/1707.09457

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Iterative Orthogonal Feature Projection for Diagnosing Bias in Black-Box Models

J. Adebayo and L. Kagal, “Iterative orthogonal feature projection for diagnosing bias in black-box models,” arXiv preprint arXiv:1611.04967, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Application of metamorphic testing to supervised classiﬁers,

X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y . Chen, “Application of metamorphic testing to supervised classiﬁers,” in Quality Software,

work page

[24] [24]

9th International Conference on

QSIC’09. 9th International Conference on . IEEE, 2009, pp. 135–144

work page 2009

[25] [25]

Testing and validating machine learning classiﬁers by metamorphic testing,

X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y . Chen, “Testing and validating machine learning classiﬁers by metamorphic testing,” Journal of Systems and Software , vol. 84, no. 4, pp. 544–558, 2011

work page 2011

[26] [26]

Empirical evaluation of approaches to testing applications without test oracles,

C. Murphy and G. Kaiser, “Empirical evaluation of approaches to testing applications without test oracles,” Dep. of Computer Science, Columbia University, Tech. Rep. CUCS-039-09 , 2010

work page 2010