pith. sign in

arxiv: 2605.05564 · v1 · submitted 2026-05-07 · 💻 cs.SE

Is this Build Failure Related to my Patch? An Empirical Study of Unrelated Build Failures in Continuous Integration

Pith reviewed 2026-05-08 09:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords continuous integrationbuild failuresunrelated failuresPU learningsemi-supervised learningempirical studyApache projectsfeature importance
0
0 comments X

The pith

Semi-supervised learning models predict whether a CI build failure is unrelated to the triggering code change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines cases in continuous integration where a build fails for reasons other than the recent code push that triggered it. Developers currently spend a median of four hours determining if each failure requires action on their change. The authors analyze thousands of failures across seven Apache projects and sample hundreds of confirmed unrelated cases to understand patterns, including that unrelated test failures account for twenty percent of developer-classified unrelated builds. They extract thirty-three features from issue reports, comments, and commits, then train semi-supervised Positive and Unlabeled models to predict unrelated failures. If effective, the models would let developers and CI tools quickly set aside failures that do not need investigation of the current patch.

Core claim

The authors extract 33 features from issue reports, issue comments, and commits associated with the triggering push. They build semi-supervised Positive and Unlabeled learning models for each of seven Apache projects. These models predict unrelated build failures and achieve precision from 0.70 to 0.88, recall from 0.30 to 1.00, F1-score from 0.44 to 0.91, and AUC from 0.63 to 0.97. Feature importance analysis identifies CI latency, repeated error messages, and the number of preceding comments as useful indicators.

What carries the argument

Semi-supervised Positive and Unlabeled (PU) learning models trained on 33 features drawn from issue reports, comments, and commits.

If this is right

  • Developers could receive automatic signals that a failure is unlikely to stem from their patch and skip unnecessary debugging.
  • CI pipelines could prioritize or route only probable related failures for immediate attention.
  • Repeated error messages and build latency emerge as practical signals that teams can monitor without full model retraining.
  • The models demonstrate that partial labeling of data suffices for useful prediction across multiple projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same approach might be combined with automated root-cause tools to suggest the actual source of an unrelated failure.
  • Teams could use early predictions to pause or reconfigure flaky test suites before full builds complete.
  • Feature sets focused on timing and repetition could transfer to other environments where build noise is common.

Load-bearing premise

The 33 features from issue reports, comments, and commits plus the sampled labeling of unrelated failures are representative enough to train models that generalize across builds and projects.

What would settle it

Applying the trained models to a new collection of build failures from the same Apache projects and measuring whether precision, recall, and AUC stay within the reported ranges or drop sharply.

Figures

Figures reproduced from arXiv: 2605.05564 by Andie Huang, Daniel Alencar da Costa, Grant Dick, Mariam El Mezouar.

Figure 1
Figure 1. Figure 1: An example of a CI bot comment in a historical issue report after the comple view at source ↗
Figure 2
Figure 2. Figure 2: Approach of Heuristic-based Labeling (HL) view at source ↗
Figure 3
Figure 3. Figure 3: The overview of the process in our study. view at source ↗
Figure 4
Figure 4. Figure 4: Time misspent on identifying unrelated build failures view at source ↗
Figure 5
Figure 5. Figure 5: An overview of the document analysis process view at source ↗
Figure 6
Figure 6. Figure 6: Dendrogram of hierarchical clustering based on Spearman correlation coef view at source ↗
Figure 7
Figure 7. Figure 7: The Process Flow for Constructing the P, Q, and N Datasets. view at source ↗
Figure 8
Figure 8. Figure 8: The distribution of representative samples (371) across each theme and view at source ↗
Figure 9
Figure 9. Figure 9: Performance metrics of the four selected models across the seven studied view at source ↗
Figure 10
Figure 10. Figure 10: Example JIRA Issue Report showing how Priority, Is Blocker and Is Dependened and Number of Parallel Issues are extracted – Number of Parallel Issues, for each issue report, we calculate the number of issues that were opened on the same day, based on the report’s creation date, and define this as the number of parallel opened issues. – Is Cross Projects: As shown in view at source ↗
Figure 11
Figure 11. Figure 11: Example of Calculating the Number of Prior Comments view at source ↗
Figure 12
Figure 12. Figure 12: Example of Retrieving the Failed Classes from Build Logs view at source ↗
Figure 13
Figure 13. Figure 13: Example of Retrieving the Has Code Patch and CI Latency view at source ↗
read the original abstract

Continuous Integration (CI) systems often run many builds concurrently. In this setting, a legitimate build failure may not be caused by the code push that triggered it. Such unrelated build failures can waste developer effort because developers must determine whether the failure is actionable for their current change. We study 77,354 CI build failures from seven open source Apache projects to understand and predict unrelated build failures. We find that developers spend a median of 4 hours identifying whether a failure is related or unrelated to their push. We also perform a document analysis of 371 confirmed unrelated build failures sampled from 10,316 potentially unrelated failures. The analysis shows that unrelated test failures account for 20% of the cases in which developers classify build failures as unrelated. To predict unrelated build failures, we extract 33 features from issue reports, issue comments, and commits associated with the triggering push. We build semi-supervised Positive and Unlabeled (PU) learning models for seven Apache projects. The models achieve precision from 0.70 to 0.88, recall from 0.30 to 1.00, F1-score from 0.44 to 0.91, and AUC from 0.63 to 0.97. Feature importance analysis shows that CI latency, repeated error messages, and the number of preceding comments are useful indicators of unrelated build failures. These results show that PU learning can help developers identify build failures that are unlikely to be caused by their current push.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper empirically studies unrelated build failures in CI systems across seven Apache projects, analyzing 77,354 build failures to show that developers spend a median of 4 hours determining relatedness. It performs document analysis on 371 sampled unrelated failures (from 10,316 candidates), finding that unrelated test failures comprise 20% of cases. Using 33 features from issue reports, comments, and commits, it trains per-project semi-supervised PU learning models that achieve precision 0.70-0.88, recall 0.30-1.00, F1 0.44-0.91, and AUC 0.63-0.97, with feature importance analysis identifying CI latency, repeated error messages, and preceding comments as useful predictors.

Significance. If the models hold, the work offers a practical way to reduce wasted developer time on non-actionable CI failures by leveraging observable artifacts and PU learning to handle limited labels. Strengths include concrete metrics on real Apache project data, identification of actionable features, and addressing a common CI pain point with semi-supervised methods. The per-project evaluation and variance analysis provide a starting point for tool support, though broader impact depends on addressing generalizability.

major comments (3)
  1. [Evaluation] The evaluation of the PU learning models reports recall ranging from 0.30 to 1.00 across projects (with F1 as low as 0.44). This variance indicates that the fixed 33-feature set may fail to capture project-specific failure patterns, especially since models are trained and evaluated separately per project without any cross-project transfer or held-out validation experiments.
  2. [Methodology] The sampling of 371 confirmed unrelated build failures from 10,316 potentially unrelated ones, combined with the PU learning setup, requires explicit details on labeling criteria, inter-rater agreement, and validation of PU assumptions (e.g., the positive-unlabeled distribution). Without these, the representativeness of the labeled set and potential sampling bias cannot be assessed, directly affecting the reliability of the reported metrics.
  3. [Results] The central claim that the approach can help developers identify unrelated failures rests on per-project models; however, the absence of cross-project generalization tests means the headline performance numbers do not establish applicability to new projects or varying CI conditions, as noted by the wide metric ranges.
minor comments (1)
  1. [Abstract] The abstract mentions the 33 features and 371 cases but provides limited transparency on the exact feature extraction process or sampling strategy; expanding this would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our approach where appropriate and indicating planned revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Evaluation] The evaluation of the PU learning models reports recall ranging from 0.30 to 1.00 across projects (with F1 as low as 0.44). This variance indicates that the fixed 33-feature set may fail to capture project-specific failure patterns, especially since models are trained and evaluated separately per project without any cross-project transfer or held-out validation experiments.

    Authors: We agree that the observed variance in recall and F1 scores reflects real differences in project-specific CI failure patterns, which motivated our per-project modeling strategy rather than a single global model. The 33 features were selected as commonly available signals across Apache projects to enable practical application. We will revise the evaluation section to include an explicit discussion of this variance as a key finding and its implications. Additionally, we will add a cross-project leave-one-out experiment to quantify transfer performance and report it in the revised manuscript. revision: yes

  2. Referee: [Methodology] The sampling of 371 confirmed unrelated build failures from 10,316 potentially unrelated ones, combined with the PU learning setup, requires explicit details on labeling criteria, inter-rater agreement, and validation of PU assumptions (e.g., the positive-unlabeled distribution). Without these, the representativeness of the labeled set and potential sampling bias cannot be assessed, directly affecting the reliability of the reported metrics.

    Authors: We will expand the methodology and document analysis sections to provide explicit labeling criteria, describing how unrelatedness was determined from issue reports, comments, and commit context for the sampled failures. We will also report the author review process used for the 371 cases and any agreement measures obtained. For the PU learning setup, we will add a dedicated subsection validating the assumptions by discussing the selection of positives from the 10,316 candidates and the nature of the unlabeled set, following established PU learning practices. These additions will allow readers to better assess representativeness and bias. revision: yes

  3. Referee: [Results] The central claim that the approach can help developers identify unrelated failures rests on per-project models; however, the absence of cross-project generalization tests means the headline performance numbers do not establish applicability to new projects or varying CI conditions, as noted by the wide metric ranges.

    Authors: We acknowledge that the wide metric ranges (particularly recall 0.30-1.00) indicate limited direct generalizability, and our claims are scoped to the seven studied projects where per-project models can be trained on historical data. The central contribution is demonstrating that PU learning on observable CI artifacts can reduce wasted effort within such projects. We will revise the results and threats-to-validity sections to more prominently state this scope as a limitation and to discuss the conditions under which the approach is expected to apply. No claim of universal applicability is made in the current manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical PU models report performance on collected CI data without tautological reduction.

full rationale

The paper collects 77,354 build failures, manually confirms 371 unrelated cases from a 10,316 sample, extracts 33 observable features from issues/commits/comments, and trains per-project PU classifiers whose precision/recall/F1/AUC values are measured directly on that labeled data. No derivation step equates a claimed prediction to its own fitted inputs by construction, no uniqueness theorem or ansatz is smuggled via self-citation, and the central results remain independent empirical measurements rather than definitional restatements. The observed metric variance reflects data characteristics, not circular logic.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on assumptions about data representativeness and feature relevance typical in empirical ML studies for software engineering.

free parameters (1)
  • PU learning model hyperparameters
    Semi-supervised models require tuning parameters that are fitted to the project data.
axioms (1)
  • domain assumption The sampled 371 unrelated failures and 10,316 potentially unrelated cases are representative of all CI build failures.
    Invoked in the document analysis and model training sections.

pith-pipeline@v0.9.0 · 5583 in / 1240 out tokens · 58887 ms · 2026-05-08T09:22:22.372428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages

  1. [1]

    Empirical analysis of practitioners’ perceptions of test flakiness factors

    AHMAD, A., LEIFLER, O.,ANDSANDAHL, K. Empirical analysis of practitioners’ perceptions of test flakiness factors. Software Testing, Verificationand Reliability 31, 8 (2021), e1791

  2. [2]

    A., COGO, F

    AJIBODE, A., BANGASH, A. A., COGO, F. R., ADAMS, B.,ANDHASSAN, A. E. Towards se- mantic versioning of open pre-trained language model releases on hugging face. Empirical Software Engineering 30, 3 (2025), 1–63

  3. [3]

    Continuous integration and continuous delivery pipeline automa- tion for agile software project management

    ARACHCHI, S.,ANDPERERA, I. Continuous integration and continuous delivery pipeline automa- tion for agile software project management. In 2018 Moratuwa Engineering Research Conference (MERCon) (2018), IEEE, pp. 156–161

  4. [4]

    Deflaker: Automatically detecting flaky tests

    BELL, J., LEGUNSEN, O., HILTON, M., ELOUSSI, L., YUNG, T.,ANDMARINOV, D. Deflaker: Automatically detecting flaky tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), IEEE, pp. 433–444

  5. [5]

    H.,DACOSTA, D

    BERNARDO, J. H.,DACOSTA, D. A.,ANDKULESZA, U. Studying the impact of adopting con- tinuous integration on the delivery time of pull requests. In Proceedings of the 15th International Conference on Mining Software Repositories (2018), pp. 131–141

  6. [6]

    BOWEN, G. A. Document analysis as a qualitative research method. Qualitative research journal (2009)

  7. [7]

    Random forests

    BREIMAN, L. Random forests. Machine learning 45, 1 (2001), 5–32

  8. [8]

    Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration

    CHEN, B., CHEN, L., ZHANG, C.,ANDPENG, X. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. In Proceedings of the 35th IEEE/ACM international conference on automated software engineering (2020), pp. 42–53

  9. [9]

    K.,ANDAKSEL, G

    C ¸ORBACIO ˘GLU, S ¸ . K.,ANDAKSEL, G. Receiver operating characteristic curve analysis in diag- nostic accuracy studies: A guide to interpreting the area under the curve value. Turkish Journal of Emergency Medicine 23, 4 (2023), 195

  10. [10]

    P., MISAROS, M., GOTA, D.,ANDMICLEA, L

    DONCA, I.-C., STAN, O. P., MISAROS, M., GOTA, D.,ANDMICLEA, L. Method for continuous integration and deployment using a pipeline generator for agile software projects. Sensors 22, 12 (2022), 4637

  11. [11]

    M., MATYAS, S.,ANDGLOVER, A

    DUVALL, P. M., MATYAS, S.,ANDGLOVER, A. Continuous integration: improving software quality and reducing risk. Pearson Education, 2007

  12. [12]

    Strength of evidence in systematic reviews in software engineering

    DYB ˚A, T.,ANDDINGSØYR, T. Strength of evidence in systematic reviews in software engineering. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement (2008), pp. 178–187

  13. [13]

    Understanding flaky tests: The developer’s perspective

    ECK, M., PALOMBA, F., CASTELLUCCIO, M.,ANDBACCHELLI, A. Understanding flaky tests: The developer’s perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2019), pp. 830–840

  14. [14]

    E.,ANDZOU, Y

    EHSAN, O., HASSAN, S., MEZOUAR, M. E.,ANDZOU, Y. An empirical study of developer discus- sions in the gitter platform. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 1 (2020), 1–39. 38 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar

  15. [15]

    A., GERMAN, D

    ELMEZOUAR, M.,DACOSTA, D. A., GERMAN, D. M.,ANDZOU, Y. Exploring the use of chatrooms by developers: An empirical study on slack and gitter. IEEE Transactions on Software Engineering 48, 10 (2021), 3988–4001

  16. [16]

    S., LOWLIND, D., ERNST, N

    ELAZHARY, O., WERNER, C., LI, Z. S., LOWLIND, D., ERNST, N. A.,ANDSTOREY, M.-A. Uncovering the benefits and challenges of continuous integration practices. IEEE Transactions on Software Engineering 48, 7 (2021), 2570–2583

  17. [17]

    Techniques for improving regression testing in continuous integration development environments

    ELBAUM, S., ROTHERMEL, G.,ANDPENIX, J. Techniques for improving regression testing in continuous integration development environments. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (2014), pp. 235–245

  18. [18]

    Learning classifiers from only positive and unlabeled data

    ELKAN, C.,ANDNOTO, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), pp. 213–220

  19. [19]

    Determining flaky tests from test failures

    ELOUSSI, L. Determining flaky tests from test failures

  20. [20]

    A., CARTAXO, B.,ANDPINTO, G

    FELIDR ´E, W., FURTADO, L.,DACOSTA, D. A., CARTAXO, B.,ANDPINTO, G. Continuous in- tegration theater. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (2019), IEEE, pp. 1–10

  21. [21]

    Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

    FORMAN, G.,ANDSCHOLZ, M. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. Acm Sigkdd Explorations Newsletter 12, 1 (2010), 49–57

  22. [22]

    Continuous integration, 2006

    FOWLER, M.,ANDFOEMMEL, M. Continuous integration, 2006

  23. [23]

    H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R

    FUSILIER, D. H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R. G. Detecting positive and negative deceptive opinions using pu-learning. Information processing & management 51, 4 (2015), 433–443

  24. [24]

    Improving the robustness and efficiency of continuous integration and deployment

    GALLABA, K. Improving the robustness and efficiency of continuous integration and deployment. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2019), IEEE, pp. 619–623

  25. [25]

    A., DACOSTA, D

    GHALEB, T. A., DACOSTA, D. A.,ANDZOU, Y. An empirical study of the long duration of continuous integration builds. Empirical Software Engineering 24, 4 (2019), 2102–2139

  26. [26]

    A.,DACOSTA, D

    GHALEB, T. A.,DACOSTA, D. A., ZOU, Y.,ANDHASSAN, A. E. Studying the impact of noises in build breakage data. IEEE Transactions on Software Engineering 47, 9 (2019), 1998–2011

  27. [27]

    A., HASSAN, S.,ANDZOU, Y

    GHALEB, T. A., HASSAN, S.,ANDZOU, Y. Studying the interplay between the durations and breakages of continuous integration builds. IEEE Transactions on Software Engineering 49, 4 (2022), 2476–2497

  28. [28]

    An exploratory study of the pull-based software development model

    GOUSIOS, G., PINZGER, M.,ANDDEURSEN, A.V. An exploratory study of the pull-based software development model. In Proceedings of the 36th international conference on software engineering (2014), pp. 345–355

  29. [29]

    Are there socioeconomic differentials in under-reporting of smoking in pregnancy? Tobacco Control 12, 4 (2003), 434–434

    GRAHAM, H.,ANDOWEN, L. Are there socioeconomic differentials in under-reporting of smoking in pregnancy? Tobacco Control 12, 4 (2003), 434–434

  30. [30]

    A.,ANDMCNEIL, B

    HANLEY, J. A.,ANDMCNEIL, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 1 (1982), 29–36

  31. [31]

    Tackling build failures in continuous integration

    HASSAN, F. Tackling build failures in continuous integration. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2019), IEEE, pp. 1242–1245

  32. [32]

    A comparative study to benchmark cross- project defect prediction approaches

    HERBOLD, S., TRAUTSCH, A.,ANDGRABOWSKI, J. A comparative study to benchmark cross- project defect prediction approaches. In Proceedings of the 40th international conference on software engineering (2018), pp. 1063–1063

  33. [33]

    Motivation of software developers in open source projects: an internet-based survey of contributors to the linux kernel

    HERTEL, G., NIEDNER, S.,ANDHERRMANN, S. Motivation of software developers in open source projects: an internet-based survey of contributors to the linux kernel. Research policy 32, 7 (2003), 1159–1177

  34. [34]

    Pu learning for matrix completion

    HSIEH, C.-J., NATARAJAN, N.,ANDDHILLON, I. Pu learning for matrix completion. In International conference on machine learning (2015), PMLR, pp. 2445–2453

  35. [35]

    unrelated-build-failures-empirical-studies, 2024.https://github.com/ckeys/ unrelated-build-failures-empirical-studies(Accessed: 2024-11-26)

    HUANG, A. unrelated-build-failures-empirical-studies, 2024.https://github.com/ckeys/ unrelated-build-failures-empirical-studies(Accessed: 2024-11-26)

  36. [36]

    A., ZHANG, F.,ANDZOU, Y

    HUANG, Y.,DACOSTA, D. A., ZHANG, F.,ANDZOU, Y. An empirical study on the issue reports with questions raised during the issue resolving process.Empirical Software Engineering 24, 2 (2019), 718–750

  37. [37]

    Evaluating learning algorithms: a classification perspective

    JAPKOWICZ, N.,ANDSHAH, M. Evaluating learning algorithms: a classification perspective. Cam- bridge University Press, 2011

  38. [38]

    An ex- tended study of syntactic breaking changes in the wild

    JAYASURIYA, D., OU, S., HEGDE, S., TERRAGNI, V., DIETRICH, J.,ANDBLINCOE, K. An ex- tended study of syntactic breaking changes in the wild. Empirical Software Engineering 30, 2 (2025), 1–45. Title Suppressed Due to Excessive Length 39

  39. [39]

    The impact of automated feature selection techniques on the interpretation of defect models

    JIARPAKDEE, J., TANTITHAMTHAVORN, C.,ANDTREUDE, C. The impact of automated feature selection techniques on the interpretation of defect models. Empirical Software Engineering 25, 5 (2020), 3590–3638

  40. [40]

    A cost-efficient approach to building in continuous integration

    JIN, X.,ANDSERVANT, F. A cost-efficient approach to building in continuous integration. In Proceedings of the ACM/IEEE 42nd International conference on software engineering (2020), pp. 13– 25

  41. [41]

    A systematic review of systematic review process research in software engineering

    KITCHENHAM, B.,ANDBRERETON, P. A systematic review of systematic review process research in software engineering. Information and software technology 55, 12 (2013), 2049–2075

  42. [42]

    Support- ing continuous integration by code-churn based test selection

    KNAUSS, E., STARON, M., MEDING, W., S ¨ODER, O., NILSSON, A.,ANDCASTELL, M. Support- ing continuous integration by code-churn based test selection. In 2015 IEEE/ACM 2nd International Workshop on Rapid Continuous Software Engineering (2015), IEEE, pp. 19–25

  43. [43]

    Measuring the cost of regression testing in practice: A study of java projects using continuous integration

    LABUSCHAGNE, A., INOZEMTSEVA, L.,ANDHOLMES, R. Measuring the cost of regression testing in practice: A study of java projects using continuous integration. In Proceedings of the 2017 11th joint meeting on foundations of software engineering (2017), pp. 821–830

  44. [44]

    When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla

    LAMPEL, J., JUST, S., APEL, S.,ANDZELLER, A. When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021), pp. 1381–1392

  45. [45]

    Collaboration tools for global software engineering

    LANUBILE, F., EBERT, C., PRIKLADNICKI, R.,ANDVIZCA ´INO, A. Collaboration tools for global software engineering. IEEE software 27, 2 (2010), 52

  46. [46]

    Random forests

    LEO,ANDBREIMAN. Random forests. Machine Learning (2001)

  47. [47]

    Weighted reward for reinforcement learning based test case prioritization in continuous integration testing

    LI, G., YANG, Y., WU, Z., CAO, T., LIU, Y.,ANDLI, Z. Weighted reward for reinforcement learning based test case prioritization in continuous integration testing. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC) (2021), IEEE, pp. 980–985

  48. [48]

    A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data

    LI, W., GUO, Q.,ANDELKAN, C. A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data. IEEE transactions on geoscience and remote sensing 49, 2 (2010), 717–725

  49. [49]

    Learning to classify texts using positive and unlabeled data

    LI, X.,ANDLIU, B. Learning to classify texts using positive and unlabeled data. In IJCAI (2003), vol. 3, Citeseer, pp. 587–592

  50. [50]

    X., AIKEN, A.,ANDJORDAN, M

    LIBLIT, B., NAIK, M., ZHENG, A. X., AIKEN, A.,ANDJORDAN, M. I. Scalable statistical bug isolation. Acm Sigplan Notices 40, 6 (2005), 15–26

  51. [51]

    S., YU, P

    LIU, B., LEE, W. S., YU, P. S.,ANDLI, X. Partially supervised classification of text documents. In ICML (2002), vol. 2, Sydney, NSW, pp. 387–394

  52. [52]

    M., BOURAFFA, A.,ANDMAALEJ, W

    L ¨UDERS, C. M., BOURAFFA, A.,ANDMAALEJ, W. Beyond duplicates: Towards understanding and predicting link types in issue tracking systems. In Proceedings of the 19th International Conference on Mining Software Repositories (2022), pp. 48–60

  53. [53]

    M., PIETZ, T.,ANDMAALEJ, W

    L ¨UDERS, C. M., PIETZ, T.,ANDMAALEJ, W. Automated detection of typed links in issue trackers. In 2022 IEEE 30th International Requirements Engineering Conference (RE) (2022), IEEE, pp. 26– 38

  54. [54]

    M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W

    L ¨UDERS, C. M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W. Openreq issue link map: A tool to visualize issue links in jira. In 2019 IEEE 27th International Requirements Engineering Conference (RE) (2019), IEEE, pp. 492–493

  55. [55]

    An empirical analysis of flaky tests

    LUO, Q., HARIRI, F., ELOUSSI, L.,ANDMARINOV, D. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering (2014), pp. 643–653

  56. [56]

    Predictive test selection

    MACHALICA, M., SAMYLKIN, A., PORTH, M.,ANDCHANDRA, S. Predictive test selection. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2019), IEEE, pp. 91–100

  57. [57]

    Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction

    MCINTOSH, S.,ANDKAMEI, Y. Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction. In Proceedings of the 40th International Conference on Software Engineering (2018), pp. 560–560

  58. [58]

    MCINTOSH, S., KAMEI, Y., ADAMS, B.,ANDHASSAN, A. E. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 21, 5 (2016), 2146–2189

  59. [59]

    Continuous integration and its tools

    MEYER, M. Continuous integration and its tools. IEEE software 31, 3 (2014), 14–16

  60. [60]

    N., LI, X.-L.,ANDNG, S.-K

    NGUYEN, M. N., LI, X.-L.,ANDNG, S.-K. Positive unlabeled learning for time series classifica- tion. In Twenty-Second International Joint Conference on Artificial Intelligence (2011), Citeseer. 40 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar

  61. [61]

    Towards language-independent brown build detection

    OLEWICKI, D., NAYROLLES, M.,ANDADAMS, B. Towards language-independent brown build detection. In Proceedings of the 44th International Conference on Software Engineering (2022), pp. 2177–2188

  62. [62]

    PALOMBA, F.,ANDZAIDMAN, A. Notice of retraction: Does refactoring of test smells induce fixing flaky tests? In 2017 IEEE international conference on software maintenance and evolution (ICSME) (2017), IEEE, pp. 1–12

  63. [63]

    Continuous test suite failure prediction

    PAN, C.,ANDPRADEL, M. Continuous test suite failure prediction. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021), pp. 553–565

  64. [64]

    E., SIY, H

    PERRY, D. E., SIY, H. P.,ANDVOTTA, L. G. Parallel changes in large-scale software development: an observational case study.ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 3 (2001), 308–337

  65. [65]

    K., WANG, S., KAMEI, Y.,ANDHASSAN, A

    RAJBAHADUR, G. K., WANG, S., KAMEI, Y.,ANDHASSAN, A. E. The impact of using regression models to build defect classifiers. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 135–145

  66. [66]

    An empirical analysis of build failures in the continuous integration workflows of java-based open-source software

    RAUSCH, T., HUMMER, W., LEITNER, P.,ANDSCHULTE, S. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 345–355

  67. [67]

    SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. On the prediction of continuous integration build failures using search-based software engineering. InProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion (2020), pp. 313–314

  68. [68]

    SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. Predicting continuous integration build failures using evolutionary search. Information and Software Technology 128 (2020), 106392

  69. [69]

    Learning ci configuration correctness for early build feedback

    SANTOLUCITO, M., ZHANG, J., ZHAI, E., CITO, J.,ANDPISKAC, R. Learning ci configuration correctness for early build feedback. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022), IEEE, pp. 1006–1017

  70. [70]

    Investigating the impact of continu- ous integration practices on the productivity and quality of open-source projects

    SANTOS, J., ALENCAR DACOSTA, D.,ANDKULESZA, U. Investigating the impact of continu- ous integration practices on the productivity and quality of open-source projects. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (2022), pp. 137–147

  71. [71]

    A., MCINTOSH, S.,ANDKULESZA, U

    SANTOS, J.,DACOSTA, D. A., MCINTOSH, S.,ANDKULESZA, U. On the need to monitor contin- uous integration practices. Empirical Software Engineering 30, 5 (2025), 125

  72. [72]

    A.,ANDZHU, L

    SHAHIN, M., BABAR, M. A.,ANDZHU, L. Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices. IEEE Access 5 (2017), 3909–3943

  73. [73]

    Understanding and improving regression test selection in con- tinuous integration

    SHI, A., ZHAO, P.,ANDMARINOV, D. Understanding and improving regression test selection in con- tinuous integration. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE) (2019), IEEE, pp. 228–238

  74. [74]

    Shake it! detecting flaky tests caused by concur- rency with shaker

    SILVA, D., TEIXEIRA, L.,AND D’AMORIM, M. Shake it! detecting flaky tests caused by concur- rency with shaker. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020), IEEE, pp. 301–311

  75. [75]

    A.,ANDKULESZA, U

    SOARES, E., SIZILIO, G., SANTOS, J.,DACOSTA, D. A.,ANDKULESZA, U. The effects of continuous integration on software development: a systematic literature review. Empirical Software Engineering 27, 3 (2022), 78

  76. [76]

    E.,ANDMATSUMOTO, K

    TANTITHAMTHAVORN, C., HASSAN, A. E.,ANDMATSUMOTO, K. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering 46, 11 (2018), 1200–1219

  77. [77]

    Verifying the selected completely at random assumption in positive-unlabeled learning

    TEISSEYRE, P., FURMA ´NCZYK, K.,ANDMIELNICZUK, J. Verifying the selected completely at random assumption in positive-unlabeled learning. arXiv preprint arXiv:2404.00145 (2024)

  78. [78]

    A container-based infrastructure for fuzzy- driven root causing of flaky tests

    TERRAGNI, V., SALZA, P.,ANDFERRUCCI, F. A container-based infrastructure for fuzzy- driven root causing of flaky tests. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) (2020), IEEE, pp. 69–72

  79. [79]

    An empirical study of flaky tests in android apps

    THORVE, S., SRESHTHA, C.,ANDMENG, N. An empirical study of flaky tests in android apps. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2018), IEEE, pp. 534–538

  80. [80]

    B.,ANDDISTEFANO, J

    TURHAN, B., MENZIES, T., BENER, A. B.,ANDDISTEFANO, J. On the relative value of cross- company and within-company data for defect prediction. Empirical Software Engineering 14, 5 (2009), 540–578

Showing first 80 references.