pith. sign in

arxiv: 2605.15232 · v1 · pith:7KAKV3AHnew · submitted 2026-05-13 · 💻 cs.SE

Method-level Change-proneness: A Better Metric for Black-box Test Suite Minimization

Pith reviewed 2026-05-19 17:13 UTC · model grok-4.3

classification 💻 cs.SE
keywords test suite minimizationchange-pronenessblack-box testingmethod-level metricssoftware testingfault detectioncall graph analysis
0
0 comments X

The pith

Method-level change-proneness provides a stronger guide than class-level metrics for shrinking black-box test suites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that change-proneness measured at the level of individual methods identifies more relevant test cases for minimization than the same metric applied at the class level. It works by pulling change data from version history, tracing which methods each test reaches through call graphs, and then ranking tests with statistical scores such as averages or geometric means. A sympathetic reader would care because large test suites are expensive to run yet many tests add little new fault-finding power, so a reliable black-box reduction method could lower costs while preserving detection of bugs. The approach is demonstrated on fifteen Java projects that together contain hundreds of known faulty versions.

Core claim

The authors establish that computing change-proneness for each method from version-control metadata, linking test cases to methods via test-code call-graph analysis, and scoring the associations with statistical measures such as average and geometric mean produces reduced test suites that retain higher accuracy and fault-detection capability than either class-level change-proneness or similarity-based selection.

What carries the argument

The MCTM process that ranks test cases by their statistical association with change-prone methods identified through call-graph dependencies.

If this is right

  • Black-box test-suite reduction becomes feasible at scale without inspecting production source code.
  • Test cases tied to change-prone methods are retained in preference to others.
  • Fault detection remains high while the number of executed tests decreases.
  • The method runs more efficiently than similarity-based alternatives on the evaluated projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same call-graph linking step could be reused in other black-box reduction techniques that already collect execution traces.
  • Projects with frequent small commits might see even stronger gains because method-level change signals would be more precise than class-level ones.
  • Teams could combine the method scores with simple execution-time data to further trim suites without additional static analysis.

Load-bearing premise

The test-code call-graph accurately captures which methods each test case actually depends on or exercises.

What would settle it

Running MCTM on a fresh collection of projects with documented buggy versions and measuring whether the average fault-detection rate drops substantially below the levels reported for the original fifteen projects.

Figures

Figures reproduced from arXiv: 2605.15232 by Kazi Sakib, Md Siam.

Figure 1
Figure 1. Figure 1: Approach Overview TABLE I: Commit Statistics for ExtendedBufferedReader::read() Method Change Commits Total Commits Insertions Deletions Remarks ExtendedBufferedReader::read(char [ ] buf , int off , int len) 4 573 91 49 Original ExtendedBufferedReader::read(char [ ] buffer , int off , int len) 2 21 32 33 Renamed ExtendedBufferedReader::read(char [], int, int) 6 573 123 82 Aggregated prior change history se… view at source ↗
Figure 2
Figure 2. Figure 2: Test case - Method Dependency Mapping Example [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Test Suite Minimization (TSM) reduces the size of test suites while preserving their fault detection capability. In black-box TSM, reduction is performed without analyzing production code. While several black-box TSM approaches have explored metrics like test logs or test similarity, those often suffer from scalability and efficiency issues. On the other hand, change-proneness (CP), recently emerged as an efficient and scalable alternative metric, has only been applied at class level. To accurately identify fault-revealing test cases, we propose CP at finer-grained method-level and implement Method-level Change-proneness based Test-suite Minimization (MCTM). MCTM first calculates CP for each method from version control metadata, then determines the dependency between test cases and methods by analyzing the test-code call-graph. Next, it scores the association between test cases and their invoked methods using statistical measures such as Average, Geometric Mean etc. Finally, test cases with the highest scores are selected to form the reduced suite. Evaluation on 15 open-source Java projects with 635 buggy versions shows MCTM achieves 0.93 accuracy and 0.94 fault detection rate on average, significantly outperforming class-level CP and similarity-based approaches while maintaining superior efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Method-level Change-proneness based Test-suite Minimization (MCTM) as an improvement over class-level change-proneness for black-box test suite minimization. MCTM computes method-level change-proneness from version-control metadata, links test cases to methods via analysis of the test-code call-graph, scores test-method associations with statistical measures (Average, Geometric Mean, etc.), and selects the highest-scoring tests. Evaluation on 15 open-source Java projects comprising 635 buggy versions reports average accuracy of 0.93 and fault detection rate of 0.94, with claims of outperforming class-level CP and similarity-based baselines while offering superior efficiency.

Significance. If the black-box property can be rigorously established and the reported performance gains hold under transparent methodology, MCTM would represent a practical advance in scalable test suite minimization for projects with version history, potentially reducing test execution costs without sacrificing fault detection. The use of real-world projects and a large number of buggy versions strengthens the empirical grounding relative to synthetic evaluations.

major comments (2)
  1. Abstract: The central premise that MCTM performs black-box TSM 'without analyzing production code' is placed in tension by the description of determining 'the dependency between test cases and methods by analyzing the test-code call-graph.' Static or dynamic construction of a call-graph that resolves production method signatures typically requires either source/bytecode access to the methods under test or execution traces exposing those signatures; this risks rendering the approach gray-box rather than black-box, which directly affects the claimed efficiency and scalability advantages over similarity-based methods that also consume execution data.
  2. Evaluation description (implied in abstract): The reported averages of 0.93 accuracy and 0.94 fault detection rate are presented without any indication of how these quantities are defined, how baselines were re-implemented, what data exclusions or parameter choices were applied, or whether statistical significance testing was performed across the 635 versions. Because these metrics are load-bearing for the superiority claim, their computation must be specified before the results can be assessed.
minor comments (2)
  1. The statistical measures used for scoring (Average, Geometric Mean, etc.) should be given explicit formulas or references in the method section to allow replication.
  2. Clarify whether the call-graph analysis is performed statically on test source only or requires dynamic execution, and state any assumptions about test-code structure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help us improve the clarity and rigor of the manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of our black-box claims and evaluation methodology.

read point-by-point responses
  1. Referee: Abstract: The central premise that MCTM performs black-box TSM 'without analyzing production code' is placed in tension by the description of determining 'the dependency between test cases and methods by analyzing the test-code call-graph.' Static or dynamic construction of a call-graph that resolves production method signatures typically requires either source/bytecode access to the methods under test or execution traces exposing those signatures; this risks rendering the approach gray-box rather than black-box, which directly affects the claimed efficiency and scalability advantages over similarity-based methods that also consume execution data.

    Authors: We appreciate the referee highlighting this important distinction. MCTM derives change-proneness exclusively from version-control metadata (commit history) without any static or dynamic inspection of production code internals. The test-code call-graph analysis is performed only on the test sources to extract call sites and the method signatures they reference; no production code is loaded, parsed, or executed for the purpose of minimization. This is distinct from gray-box approaches that profile production execution or analyze production dependencies. To eliminate ambiguity, we will revise the abstract and Section 3 to include a precise definition of the black-box property in this context and explicitly state the boundaries of the call-graph analysis. revision: yes

  2. Referee: Evaluation description (implied in abstract): The reported averages of 0.93 accuracy and 0.94 fault detection rate are presented without any indication of how these quantities are defined, how baselines were re-implemented, what data exclusions or parameter choices were applied, or whether statistical significance testing was performed across the 635 versions. Because these metrics are load-bearing for the superiority claim, their computation must be specified before the results can be assessed.

    Authors: We agree that the abstract alone does not convey these details and that the evaluation section would benefit from greater transparency. In the full manuscript, accuracy is defined as the fraction of test cases correctly classified as fault-revealing or non-fault-revealing, and fault detection rate is the proportion of known bugs still detected by the minimized suite. Baselines were re-implemented following the original papers' descriptions, using the same 15 projects and 635 buggy versions; we applied no data exclusions beyond requiring sufficient version history for change-proneness computation. We will add a new subsection (e.g., 4.3) that formally defines both metrics, documents re-implementation choices and parameter settings, lists any filtering criteria, and reports statistical significance results (Wilcoxon signed-rank test with p-values) comparing MCTM against class-level CP and similarity baselines across all versions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical pipeline for MCTM: CP scores are computed directly from external version-control metadata, dependencies are extracted via test-code call-graph analysis, and association scores are computed with standard statistical aggregates (Average, Geometric Mean) before selecting top-ranked tests. No equations, fitted parameters, or self-referential definitions appear in which an output is forced to equal an input by construction. Evaluation is performed on 15 independent open-source projects with real buggy versions, rendering the central claims externally falsifiable rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard software-engineering assumptions about call graphs and historical change data rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Test-code call-graph analysis accurately identifies the methods invoked by each test case.
    This premise is required to compute the association scores between tests and change-prone methods.
  • domain assumption Method-level change-proneness derived from version control metadata serves as a reliable proxy for fault-revealing potential.
    This is the core metric used to rank and select test cases.

pith-pipeline@v0.9.0 · 5749 in / 1332 out tokens · 66457 ms · 2026-05-19T17:13:45.133174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Regression testing minimization, selection and prioritization: a survey,

    S. Yoo and M. Harman, “Regression testing minimization, selection and prioritization: a survey,”Softw. Test. Verification Reliab., vol. 22, no. 2, pp. 67–120, 2012. [Online]. Available: https://doi.org/10.1002/stv.430

  2. [2]

    A systematic review on test suite reduction: Approaches, experiment’s quality evaluation, and guidelines,

    S. U. R. Khan, S. P. Lee, N. Javaid, and W. Abdul, “A systematic review on test suite reduction: Approaches, experiment’s quality evaluation, and guidelines,”IEEE Access, vol. 6, pp. 11 816–11 841, 2018. [Online]. Available: https://doi.org/10.1109/ACCESS.2018.2809600

  3. [3]

    Frontiers in Astronomy and Space Sciences , keywords =

    A. A. Philip, R. Bhagwan, R. Kumar, C. S. Maddila, and N. Nagappan, “Fastlane: test minimization for rapidly deployed large-scale online services,” inProceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, J. M. Atlee, T. Bultan, and J. Whittle, Eds. IEEE / ACM, 2019, pp. 408–418. [Online...

  4. [4]

    In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp

    R. Pan, T. A. Ghaleb, and L. C. Briand, “ATM: black-box test case minimization based on test code similarity and evolutionary search,” in45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2023, pp. 1700–1711. [Online]. Available: https://doi.org/10.1109/ICSE48619. 2023.00146

  5. [5]

    Ltm: Scalable and black-box similarity-based test suite mini- mization based on language models,

    ——, “Ltm: Scalable and black-box similarity-based test suite mini- mization based on language models,”IEEE Transactions on Software Engineering, pp. 1–19, 2024

  6. [6]

    An exploratory study on the impact of change-proneness as a metric in black-box test suite minimization,

    M. Siam, M. N. Fuad, and K. Sakib, “An exploratory study on the impact of change-proneness as a metric in black-box test suite minimization,” inIEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Montreal, Canada, March 4-7, 2025, 2025, pp. 855–860

  7. [7]

    Reconstructing fine-grained ver- sioning repositories with git for method-level bug prediction,

    H. Hata, O. Mizuno, and T. Kikuno, “Reconstructing fine-grained ver- sioning repositories with git for method-level bug prediction,”IWESEP ‘10, pp. 27–32, 2010

  8. [8]

    and Fu, C

    ——, “Bug prediction based on fine-grained module histories,” in34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, M. Glinz, G. C. Murphy, and M. Pezz `e, Eds. IEEE Computer Society, 2012, pp. 200–210. [Online]. Available: https://doi.org/10.1109/ICSE.2012.6227193

  9. [9]

    Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem,

    C. Catal and B. Diri, “Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem,” Information Sciences, vol. 179, no. 8, pp. 1040–1058, 2009

  10. [10]

    Comparing fine-grained source code changes and code churn for bug prediction,

    E. Giger, M. Pinzger, and H. C. Gall, “Comparing fine-grained source code changes and code churn for bug prediction,” inProceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, May 21-28, 2011, Proceedings, A. van Deursen, T. Xie, and T. Zimmermann, Eds. ACM, 2011...

  11. [11]

    Software metrics reduction for fault- proneness prediction of software modules,

    Y . Luo, K. Ben, and L. Mi, “Software metrics reduction for fault- proneness prediction of software modules,” inIFIP International Con- ference on Network and Parallel Computing. Springer, 2010, pp. 432– 441

  12. [12]

    Test case prioritization using test case diversification and fault-proneness estima- tions,

    M. Mahdieh, S.-H. Mirian-Hosseinabadi, and M. Mahdieh, “Test case prioritization using test case diversification and fault-proneness estima- tions,”Automated Software Engineering, vol. 29, no. 2, p. 50, 2022

  13. [13]

    Scalable approaches for test suite reduction,

    E. Cruciani, B. Miranda, R. Verdecchia, and A. Bertolino, “Scalable approaches for test suite reduction,” inProceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, J. M. Atlee, T. Bultan, and J. Whittle, Eds. IEEE / ACM, 2019, pp. 419–429. [Online]. Available: https://doi.org/10.1109/ICSE...

  14. [14]

    A method for assessing class change proneness,

    E. Arvanitou, A. Ampatzoglou, A. Chatzigeorgiou, and P. Avgeriou, “A method for assessing class change proneness,” inProceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, EASE 2017, Karlskrona, Sweden, June 15-16, 2017, E. Mendes, S. Counsell, and K. Petersen, Eds. ACM, 2017, pp. 186–195. [Online]. Availabl...

  15. [15]

    A survey on test suite reduction frameworks and tools,

    S. U. R. Khan, S. P. Lee, R. W. Ahmad, A. Akhunzada, and V . Chang, “A survey on test suite reduction frameworks and tools,”Int. J. Inf. Manag., vol. 36, no. 6, pp. 963–975, 2016. [Online]. Available: https://doi.org/10.1016/j.ijinfomgt.2016.05.025

  16. [16]

    Scope-aided test prioritization, selection and minimization for software reuse,

    B. Miranda and A. Bertolino, “Scope-aided test prioritization, selection and minimization for software reuse,”Journal of Systems and Software, vol. 131, pp. 528–549, 2017

  17. [17]

    An evaluation of test suite minimization techniques,

    R. Noemmer and R. Haas, “An evaluation of test suite minimization techniques,” inInternational Conference on Software Quality. Springer, 2019, pp. 51–66

  18. [18]

    Ant colony optimization (aco-min) algorithm for test suite minimization,

    S. Mohanty, S. K. Mohapatra, and S. F. Meko, “Ant colony optimization (aco-min) algorithm for test suite minimization,” inProgress in Comput- ing, Analytics and Networking: Proceedings of ICCAN 2019. Springer, 2020, pp. 55–63

  19. [19]

    Achieving scalable model- based testing through test case diversity,

    H. Hemmati, A. Arcuri, and L. Briand, “Achieving scalable model- based testing through test case diversity,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 22, no. 1, pp. 1–42, 2013

  20. [20]

    Uncertainty-wise test case generation and minimization for cyber-physical systems,

    M. Zhang, S. Ali, and T. Yue, “Uncertainty-wise test case generation and minimization for cyber-physical systems,”Journal of Systems and Software, vol. 153, pp. 1–21, 2019

  21. [21]

    User-session- based test cases optimization method based on agglutinate hierarchy clustering,

    Y . Liu, K. Wang, W. Wei, B. Zhang, and H. Zhong, “User-session- based test cases optimization method based on agglutinate hierarchy clustering,” in2011 International Conference on Internet of Things and 4th International Conference on Cyber, Physical and Social Computing. IEEE, 2011, pp. 413–418

  22. [22]

    Clustering support for inadequate test suite reduction,

    C. Coviello, S. Romano, G. Scanniello, A. Marchetto, G. Antoniol, and A. Corazza, “Clustering support for inadequate test suite reduction,” in2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2018, pp. 95–105

  23. [23]

    Pareto efficient multi-objective black-box test case selec- tion for simulation-based testing,

    A. Arrieta, S. Wang, U. Markiegi, A. Arruabarrena, L. Etxeberria, and G. Sagardui, “Pareto efficient multi-objective black-box test case selec- tion for simulation-based testing,”Information and Software Technology, vol. 114, pp. 137–154, 2019

  24. [24]

    Test suite reduction methods that decrease regression testing costs by identifying irreplace- able tests,

    C.-T. Lin, K.-W. Tang, and G. M. Kapfhammer, “Test suite reduction methods that decrease regression testing costs by identifying irreplace- able tests,”Information and Software Technology, vol. 56, no. 10, pp. 1322–1344, 2014

  25. [25]

    Reducing the cost of regression testing by identifying irreplaceable test cases,

    C.-T. Lin, K.-W. Tang, C.-D. Chen, and G. M. Kapfhammer, “Reducing the cost of regression testing by identifying irreplaceable test cases,” in2012 Sixth International Conference on Genetic and Evolutionary Computing. IEEE, 2012, pp. 257–260

  26. [26]

    Extensions of lipschitz mappings into a hilbert space,

    W. B. Johnson, J. Lindenstrausset al., “Extensions of lipschitz mappings into a hilbert space,”Contemporary mathematics, vol. 26, no. 189-206, p. 1, 1984

  27. [27]

    From frequency to meaning: Vector space models of semantics,

    P. D. Turney and P. Pantel, “From frequency to meaning: Vector space models of semantics,”Journal of artificial intelligence research, vol. 37, pp. 141–188, 2010

  28. [28]

    Revisiting method-level change prediction: A comparative evaluation at different granularities,

    H. Sugimori and S. Hayashi, “Revisiting method-level change prediction: A comparative evaluation at different granularities,”CoRR, vol. abs/2502.17908, 2025. [Online]. Available: https://doi.org/10. 48550/arXiv.2502.17908

  29. [29]

    Empirical evaluation of fault localisation using code and change metrics,

    J. Sohn and S. Yoo, “Empirical evaluation of fault localisation using code and change metrics,”IEEE Transactions on Software Engineering, vol. 47, no. 8, pp. 1605–1625, 2019

  30. [30]

    How well do change sequences predict defects? sequence learning from software changes,

    M. Wen, R. Wu, and S.-C. Cheung, “How well do change sequences predict defects? sequence learning from software changes,”IEEE Trans- actions on Software Engineering, vol. 46, no. 11, pp. 1155–1175, 2018

  31. [31]

    A sequential comparative analysis of software change proneness prediction using machine learning,

    R. Abbas and F. A. Albalooshi, “A sequential comparative analysis of software change proneness prediction using machine learning,”Int. J. Softw. Innov., vol. 10, no. 1, pp. 1–16, 2022. [Online]. Available: https://doi.org/10.4018/ijsi.297993

  32. [32]

    Liu and S

    R. Koc ¸i, X. Franch, P. Jovanovic, and A. Abell ´o, “Web API change- proneness prediction,” inIEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024, Rovaniemi, Finland, March 12-15, 2024. IEEE, 2024, pp. 429–434. [Online]. Available: https://doi.org/10.1109/SANER60148.2024.00050

  33. [33]

    How bugs are born: a model to identify how bugs are introduced in software components,

    G. Rodr ´ıguez-P´erez, G. Robles, A. Serebrenik, A. Zaidman, D. M. Germ´an, and J. M. Gonz ´alez-Barahona, “How bugs are born: a model to identify how bugs are introduced in software components,”Empir. Softw. Eng., vol. 25, no. 2, pp. 1294–1340, 2020. [Online]. Available: https://doi.org/10.1007/s10664-019-09781-y

  34. [34]

    Software fault prediction based on change metrics using hybrid algorithms: An empirical study,

    W. Rhmann, B. Pandey, G. A. Ansari, and D. K. Pandey, “Software fault prediction based on change metrics using hybrid algorithms: An empirical study,”J. King Saud Univ. Comput. Inf. Sci., vol. 32, no. 4, pp. 419–424, 2020. [Online]. Available: https://doi.org/10.1016/j.jksuci.2019.03.006

  35. [35]

    Ownership, experience and defects: a fine- grained study of authorship,

    F. Rahman and P. Devanbu, “Ownership, experience and defects: a fine- grained study of authorship,” inProceedings of the 33rd international conference on software engineering, 2011, pp. 491–500

  36. [36]

    Change-proneness of object-oriented software using combination of feature selection techniques and ensemble learning techniques,

    L. Kumar, S. Lal, A. Goyal, and N. L. B. Murthy, “Change-proneness of object-oriented software using combination of feature selection techniques and ensemble learning techniques,” inProceedings of the 12th Innovations on Software Engineering Conference (formerly known as India Software Engineering Conference), ISEC 2019, Pune, India, February 14-16, 2019,...

  37. [37]

    Dynamic coupling measurement for object-oriented software,

    E. Arisholm, L. C. Briand, and A. Føyen, “Dynamic coupling measurement for object-oriented software,”IEEE Trans. Software Eng., vol. 30, no. 8, pp. 491–506, 2004. [Online]. Available: https://doi.org/10.1109/TSE.2004.41

  38. [38]

    Comparing high-change modules and modules with the highest measurement values in two large-scale open-source products,

    A. G. Koru and J. Tian, “Comparing high-change modules and modules with the highest measurement values in two large-scale open-source products,”IEEE Trans. Software Eng., vol. 31, no. 8, pp. 625–642,

  39. [39]

    Available: https://doi.org/10.1109/TSE.2005.89

    [Online]. Available: https://doi.org/10.1109/TSE.2005.89

  40. [40]

    Frankenstein: fast and lightweight call graph generation for software builds,

    M. Keshani, G. Gousios, and S. Proksch, “Frankenstein: fast and lightweight call graph generation for software builds,”Empir. Softw. Eng., vol. 29, no. 1, p. 1, 2024. [Online]. Available: https://doi.org/10.1007/s10664-023-10388-7

  41. [41]

    Optimization of automated and manual software tests in industrial practice: A survey and historical analysis,

    R. Haas, R. N ¨ommer, E. Juergens, and S. Apel, “Optimization of automated and manual software tests in industrial practice: A survey and historical analysis,”IEEE Trans. Software Eng., vol. 50, no. 8, pp. 2005–2020, 2024. [Online]. Available: https: //doi.org/10.1109/TSE.2024.3418191

  42. [42]

    Robust multi-sensor fusion positioning based on gnss/imu using factor graph optimization,

    E. Ahmadi, M. Elsanhoury, K. Selvan, P. V ¨alisuo, and H. Kuusniemi, “Robust multi-sensor fusion positioning based on gnss/imu using factor graph optimization,” in2025 IEEE/ION Position, Location and Naviga- tion Symposium (PLANS). IEEE, 2025, pp. 1247–1256

  43. [43]

    Robust statistics for outlier detection,

    P. J. Rousseeuw and M. Hubert, “Robust statistics for outlier detection,” Wiley interdisciplinary reviews: Data mining and knowledge discovery, vol. 1, no. 1, pp. 73–79, 2011

  44. [44]

    Defects4j: A database of existing faults to enable controlled testing studies for java programs,

    R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” inPro- ceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440

  45. [45]

    Pydriller: Python framework for mining software repositories,

    D. Spadini, M. Aniche, and A. Bacchelli, “Pydriller: Python framework for mining software repositories,” inProceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2018, pp. 908– 911

  46. [46]

    Systematic comparison of six open-source java call graph construction tools,

    J. J ´asz, I. Siket, E. Pengo, Z. S ´agodi, and R. Ferenc, “Systematic comparison of six open-source java call graph construction tools,” inProceedings of the 14th International Conference on Software Technologies, ICSOFT 2019, Prague, Czech Republic, July 26-28, 2019, M. van Sinderen and L. A. Maciaszek, Eds. SciTePress, 2019, pp. 117–

  47. [47]

    Available: https://doi.org/10.5220/0007929201170128

    [Online]. Available: https://doi.org/10.5220/0007929201170128

  48. [48]

    An exact test for population differentia- tion,

    M. Raymond and F. Rousset, “An exact test for population differentia- tion,”Evolution, pp. 1280–1283, 1995

  49. [49]

    A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering,

    A. Arcuri and L. Briand, “A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering,”Software Testing, Verification and Reliability, vol. 24, no. 3, pp. 219–250, 2014

  50. [50]

    Mctm replication package,

    Anonymous, “Mctm replication package,” 2025, accessed: 2025-05-31. [Online]. Available: https://figshare.com/s/8276e5a92bdf39b08d93

  51. [51]

    Freeman and E

    E. Freeman and E. Freeman,Head first design patterns - your brain on design patterns. O’Reilly, 2004. [Online]. Available: http://www.oreilly.de/catalog/hfdesignpat/index.html

  52. [52]

    An evaluation of test suite minimization techniques,

    R. Noemmer and R. Haas, “An evaluation of test suite minimization techniques,” inSoftware Quality: Quality Intelligence in Software and Systems Engineering - 12th International Conference, SWQD 2020, Vienna, Austria, January 14-17, 2020, Proceedings, ser. Lecture Notes in Business Information Processing, D. Winkler, S. Biffl, D. M ´endez, and J. Bergsmann...

  53. [53]

    A large-scale empirical com- parison of static and dynamic test case prioritization techniques,

    Q. Luo, K. Moran, and D. Poshyvanyk, “A large-scale empirical com- parison of static and dynamic test case prioritization techniques,” in Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, 2016, pp. 559–570