A Systematic Mapping Study on Testing of Machine Learning Programs
Pith reviewed 2026-05-24 22:58 UTC · model grok-4.3
The pith
A systematic mapping of testing machine learning programs shows rapid growth but insufficient empirical evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By following established systematic mapping guidelines, the study selected 37 papers and analyzed trends in contribution facets, research facets, test approaches, types of machine learning, and kinds of testing. The mapping reveals rapid growth in the area alongside a lack of sufficient empirical evidence for comparing techniques, a shortage of publicly available tools, and insufficient attention to non-functional testing and reinforcement learning.
What carries the argument
The systematic mapping protocol including research questions, inclusion/exclusion criteria, and classification scheme for themes in testing ML programs.
If this is right
- The area of testing ML programs is growing rapidly.
- There is a lack of enough empirical evidence to compare and assess the effectiveness of techniques.
- More publicly available tools are required for practitioners and researchers.
- Further attention is needed on non-functional testing and testing of ML programs using reinforcement learning.
Where Pith is reading between the lines
- Researchers could use this overview to prioritize studies that provide comparative empirical evaluations of existing testing methods.
- Tool developers might focus on creating open-source implementations of the most promising techniques identified in the mapping.
- The identified gaps suggest that integration of testing practices into ML development pipelines remains limited.
- Future mappings could track how these trends evolve beyond 2019.
Load-bearing premise
The search strategy, inclusion and exclusion criteria, and classification scheme applied up to January 2019 produce a representative sample of the literature.
What would settle it
A follow-up search using the same protocol after 2019 that identifies many additional papers showing strong empirical comparisons or widely used public tools would contradict the reported gaps.
read the original abstract
We aim to conduct a systematic mapping in the area of testing ML programs. We identify, analyze and classify the existing literature to provide an overview of the area. We followed well-established guidelines of systematic mapping to develop a systematic protocol to identify and review the existing literature. We formulate three sets of research questions, define inclusion and exclusion criteria and systematically identify themes for the classification of existing techniques. We also report the quality of the published works using established assessment criteria. we finally selected 37 papers out of 1654 based on our selection criteria up to January 2019. We analyze trends such as contribution facet, research facet, test approach, type of ML and the kind of testing with several other attributes. We also discuss the empirical evidence and reporting quality of selected papers. The data from the study is made publicly available for other researchers and practitioners. We present an overview of the area by answering several research questions. The area is growing rapidly, however, there is lack of enough empirical evidence to compare and assess the effectiveness of the techniques. More publicly available tools are required for use of practitioners and researchers. Further attention is needed on non-functional testing and testing of ML programs using reinforcement learning. We believe that this study can help researchers and practitioners to obtain an overview of the area and identify several sub-areas where more research is required
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper conducts a systematic mapping study on testing of machine learning programs. Following established guidelines, the authors define research questions, search multiple databases up to January 2019, apply inclusion/exclusion criteria to select 37 papers from 1654 candidates, classify them by contribution facet, research facet, test approach, ML type and testing kind, assess reporting quality, and release the data publicly. They report trends in the literature and conclude that the area is growing rapidly but lacks sufficient empirical evidence for technique comparison, requires more public tools, and needs further work on non-functional testing and reinforcement learning.
Significance. If the sample is representative, the study offers a useful snapshot of an emerging subfield, surfaces actionable gaps (empirical comparisons, tool availability, non-functional/RL coverage), and provides a public dataset that can seed follow-on reviews or targeted primary studies.
major comments (2)
- [Abstract] Abstract and selection description: the central claims (rapid growth, insufficient empirical evidence, gaps in non-functional and RL testing) rest entirely on counts and trends across the final 37 papers; the manuscript reports no inter-rater agreement statistic for the classification scheme nor any sensitivity analysis on search strings, databases, or date cutoff, leaving open the possibility that alternate terminology or modest protocol changes would alter the reported gaps.
- [Quality assessment] Quality assessment paragraph: the claim that the 37 papers exhibit limited empirical evidence is load-bearing for the main recommendation, yet the text provides no breakdown of how the established quality criteria were scored per paper or any aggregate reliability check, making it impossible to judge whether the evidence-quality conclusion is robust.
minor comments (2)
- [Abstract] Abstract contains inconsistent capitalization and sentence-initial capitalization errors (e.g., 'we finally selected', 'we also discuss', 'we present').
- The date cutoff (January 2019) is stated but the manuscript does not discuss how the rapidly evolving terminology in ML testing might affect completeness even within that window.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on transparency and robustness in our systematic mapping study. We address each major comment below and will revise the manuscript accordingly to improve reporting.
read point-by-point responses
-
Referee: [Abstract] Abstract and selection description: the central claims (rapid growth, insufficient empirical evidence, gaps in non-functional and RL testing) rest entirely on counts and trends across the final 37 papers; the manuscript reports no inter-rater agreement statistic for the classification scheme nor any sensitivity analysis on search strings, databases, or date cutoff, leaving open the possibility that alternate terminology or modest protocol changes would alter the reported gaps.
Authors: We agree that reporting inter-rater agreement would strengthen the classification reliability. The process involved multiple authors with discussion-based resolution of disagreements, but no formal statistic (e.g., Cohen's kappa) was computed or included. We will add this in the revised manuscript using available records. For sensitivity analysis, our protocol followed established guidelines with pilot searches; we will add a limitations subsection discussing the search rationale and potential impacts of variations without claiming the gaps are invariant. revision: yes
-
Referee: [Quality assessment] Quality assessment paragraph: the claim that the 37 papers exhibit limited empirical evidence is load-bearing for the main recommendation, yet the text provides no breakdown of how the established quality criteria were scored per paper or any aggregate reliability check, making it impossible to judge whether the evidence-quality conclusion is robust.
Authors: We recognize that a per-paper breakdown of quality scores would allow better evaluation of the empirical evidence conclusion. The criteria were applied individually to support the aggregate assessment. We will include a supplementary table with scores for each of the 37 papers and any reliability checks in the revision. revision: yes
Circularity Check
No circularity: descriptive literature mapping with no derivations or self-referential reductions
full rationale
The paper is a systematic mapping study that searches databases up to January 2019, applies inclusion/exclusion criteria, selects 37 papers from 1654, and classifies them by facets such as contribution, research type, test approach, ML type, and testing kind. All reported trends, gaps, and recommendations are direct aggregates and summaries of those classifications. There are no equations, fitted parameters, predictions, first-principles derivations, or load-bearing self-citations that reduce the central claims to the inputs by construction. The methodology description does not create a self-definitional loop; the output is the classification itself. This matches the default expectation of no circularity for non-derivational descriptive work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Well-established guidelines of systematic mapping are appropriate for identifying and classifying literature on testing ML programs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.