A Systematic Mapping Study on Testing of Machine Learning Programs

Muhammad Uzair khan; Muhammad Zohaib Iqbal; Salman Sherin

arxiv: 1907.09427 · v1 · pith:AGFD3DINnew · submitted 2019-07-11 · 💻 cs.LG

A Systematic Mapping Study on Testing of Machine Learning Programs

Salman Sherin , Muhammad Uzair khan , Muhammad Zohaib Iqbal This is my paper

Pith reviewed 2026-05-24 22:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords machine learningsoftware testingsystematic mapping studyempirical evidencereinforcement learningnon-functional testingtest approaches

0 comments

The pith

A systematic mapping of testing machine learning programs shows rapid growth but insufficient empirical evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper performs a systematic mapping study to identify, analyze, and classify existing research on testing machine learning programs. The authors reviewed 1654 papers and selected 37 that met their criteria up to January 2019. They classify the work by contribution type, research type, testing approach, and machine learning category. The study finds the field expanding quickly but short on empirical comparisons of technique effectiveness and on publicly available tools. It highlights the need for more work on non-functional testing and on programs using reinforcement learning.

Core claim

By following established systematic mapping guidelines, the study selected 37 papers and analyzed trends in contribution facets, research facets, test approaches, types of machine learning, and kinds of testing. The mapping reveals rapid growth in the area alongside a lack of sufficient empirical evidence for comparing techniques, a shortage of publicly available tools, and insufficient attention to non-functional testing and reinforcement learning.

What carries the argument

The systematic mapping protocol including research questions, inclusion/exclusion criteria, and classification scheme for themes in testing ML programs.

If this is right

The area of testing ML programs is growing rapidly.
There is a lack of enough empirical evidence to compare and assess the effectiveness of techniques.
More publicly available tools are required for practitioners and researchers.
Further attention is needed on non-functional testing and testing of ML programs using reinforcement learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers could use this overview to prioritize studies that provide comparative empirical evaluations of existing testing methods.
Tool developers might focus on creating open-source implementations of the most promising techniques identified in the mapping.
The identified gaps suggest that integration of testing practices into ML development pipelines remains limited.
Future mappings could track how these trends evolve beyond 2019.

Load-bearing premise

The search strategy, inclusion and exclusion criteria, and classification scheme applied up to January 2019 produce a representative sample of the literature.

What would settle it

A follow-up search using the same protocol after 2019 that identifies many additional papers showing strong empirical comparisons or widely used public tools would contradict the reported gaps.

read the original abstract

We aim to conduct a systematic mapping in the area of testing ML programs. We identify, analyze and classify the existing literature to provide an overview of the area. We followed well-established guidelines of systematic mapping to develop a systematic protocol to identify and review the existing literature. We formulate three sets of research questions, define inclusion and exclusion criteria and systematically identify themes for the classification of existing techniques. We also report the quality of the published works using established assessment criteria. we finally selected 37 papers out of 1654 based on our selection criteria up to January 2019. We analyze trends such as contribution facet, research facet, test approach, type of ML and the kind of testing with several other attributes. We also discuss the empirical evidence and reporting quality of selected papers. The data from the study is made publicly available for other researchers and practitioners. We present an overview of the area by answering several research questions. The area is growing rapidly, however, there is lack of enough empirical evidence to compare and assess the effectiveness of the techniques. More publicly available tools are required for use of practitioners and researchers. Further attention is needed on non-functional testing and testing of ML programs using reinforcement learning. We believe that this study can help researchers and practitioners to obtain an overview of the area and identify several sub-areas where more research is required

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper conducts a systematic mapping study on testing of machine learning programs. Following established guidelines, the authors define research questions, search multiple databases up to January 2019, apply inclusion/exclusion criteria to select 37 papers from 1654 candidates, classify them by contribution facet, research facet, test approach, ML type and testing kind, assess reporting quality, and release the data publicly. They report trends in the literature and conclude that the area is growing rapidly but lacks sufficient empirical evidence for technique comparison, requires more public tools, and needs further work on non-functional testing and reinforcement learning.

Significance. If the sample is representative, the study offers a useful snapshot of an emerging subfield, surfaces actionable gaps (empirical comparisons, tool availability, non-functional/RL coverage), and provides a public dataset that can seed follow-on reviews or targeted primary studies.

major comments (2)

[Abstract] Abstract and selection description: the central claims (rapid growth, insufficient empirical evidence, gaps in non-functional and RL testing) rest entirely on counts and trends across the final 37 papers; the manuscript reports no inter-rater agreement statistic for the classification scheme nor any sensitivity analysis on search strings, databases, or date cutoff, leaving open the possibility that alternate terminology or modest protocol changes would alter the reported gaps.
[Quality assessment] Quality assessment paragraph: the claim that the 37 papers exhibit limited empirical evidence is load-bearing for the main recommendation, yet the text provides no breakdown of how the established quality criteria were scored per paper or any aggregate reliability check, making it impossible to judge whether the evidence-quality conclusion is robust.

minor comments (2)

[Abstract] Abstract contains inconsistent capitalization and sentence-initial capitalization errors (e.g., 'we finally selected', 'we also discuss', 'we present').
The date cutoff (January 2019) is stated but the manuscript does not discuss how the rapidly evolving terminology in ML testing might affect completeness even within that window.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on transparency and robustness in our systematic mapping study. We address each major comment below and will revise the manuscript accordingly to improve reporting.

read point-by-point responses

Referee: [Abstract] Abstract and selection description: the central claims (rapid growth, insufficient empirical evidence, gaps in non-functional and RL testing) rest entirely on counts and trends across the final 37 papers; the manuscript reports no inter-rater agreement statistic for the classification scheme nor any sensitivity analysis on search strings, databases, or date cutoff, leaving open the possibility that alternate terminology or modest protocol changes would alter the reported gaps.

Authors: We agree that reporting inter-rater agreement would strengthen the classification reliability. The process involved multiple authors with discussion-based resolution of disagreements, but no formal statistic (e.g., Cohen's kappa) was computed or included. We will add this in the revised manuscript using available records. For sensitivity analysis, our protocol followed established guidelines with pilot searches; we will add a limitations subsection discussing the search rationale and potential impacts of variations without claiming the gaps are invariant. revision: yes
Referee: [Quality assessment] Quality assessment paragraph: the claim that the 37 papers exhibit limited empirical evidence is load-bearing for the main recommendation, yet the text provides no breakdown of how the established quality criteria were scored per paper or any aggregate reliability check, making it impossible to judge whether the evidence-quality conclusion is robust.

Authors: We recognize that a per-paper breakdown of quality scores would allow better evaluation of the empirical evidence conclusion. The criteria were applied individually to support the aggregate assessment. We will include a supplementary table with scores for each of the 37 papers and any reliability checks in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive literature mapping with no derivations or self-referential reductions

full rationale

The paper is a systematic mapping study that searches databases up to January 2019, applies inclusion/exclusion criteria, selects 37 papers from 1654, and classifies them by facets such as contribution, research type, test approach, ML type, and testing kind. All reported trends, gaps, and recommendations are direct aggregates and summaries of those classifications. There are no equations, fitted parameters, predictions, first-principles derivations, or load-bearing self-citations that reduce the central claims to the inputs by construction. The methodology description does not create a self-definitional loop; the output is the classification itself. This matches the default expectation of no circularity for non-derivational descriptive work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that systematic mapping guidelines are suitable and that the search up to January 2019 is sufficiently complete; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Well-established guidelines of systematic mapping are appropriate for identifying and classifying literature on testing ML programs.
The paper states it followed these guidelines to develop the protocol.

pith-pipeline@v0.9.0 · 5771 in / 1141 out tokens · 19364 ms · 2026-05-24T22:58:20.526588+00:00 · methodology

A Systematic Mapping Study on Testing of Machine Learning Programs

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)