Is this Build Failure Related to my Patch? An Empirical Study of Unrelated Build Failures in Continuous Integration
Pith reviewed 2026-05-08 09:22 UTC · model grok-4.3
The pith
Semi-supervised learning models predict whether a CI build failure is unrelated to the triggering code change.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors extract 33 features from issue reports, issue comments, and commits associated with the triggering push. They build semi-supervised Positive and Unlabeled learning models for each of seven Apache projects. These models predict unrelated build failures and achieve precision from 0.70 to 0.88, recall from 0.30 to 1.00, F1-score from 0.44 to 0.91, and AUC from 0.63 to 0.97. Feature importance analysis identifies CI latency, repeated error messages, and the number of preceding comments as useful indicators.
What carries the argument
Semi-supervised Positive and Unlabeled (PU) learning models trained on 33 features drawn from issue reports, comments, and commits.
If this is right
- Developers could receive automatic signals that a failure is unlikely to stem from their patch and skip unnecessary debugging.
- CI pipelines could prioritize or route only probable related failures for immediate attention.
- Repeated error messages and build latency emerge as practical signals that teams can monitor without full model retraining.
- The models demonstrate that partial labeling of data suffices for useful prediction across multiple projects.
Where Pith is reading between the lines
- The same approach might be combined with automated root-cause tools to suggest the actual source of an unrelated failure.
- Teams could use early predictions to pause or reconfigure flaky test suites before full builds complete.
- Feature sets focused on timing and repetition could transfer to other environments where build noise is common.
Load-bearing premise
The 33 features from issue reports, comments, and commits plus the sampled labeling of unrelated failures are representative enough to train models that generalize across builds and projects.
What would settle it
Applying the trained models to a new collection of build failures from the same Apache projects and measuring whether precision, recall, and AUC stay within the reported ranges or drop sharply.
Figures
read the original abstract
Continuous Integration (CI) systems often run many builds concurrently. In this setting, a legitimate build failure may not be caused by the code push that triggered it. Such unrelated build failures can waste developer effort because developers must determine whether the failure is actionable for their current change. We study 77,354 CI build failures from seven open source Apache projects to understand and predict unrelated build failures. We find that developers spend a median of 4 hours identifying whether a failure is related or unrelated to their push. We also perform a document analysis of 371 confirmed unrelated build failures sampled from 10,316 potentially unrelated failures. The analysis shows that unrelated test failures account for 20% of the cases in which developers classify build failures as unrelated. To predict unrelated build failures, we extract 33 features from issue reports, issue comments, and commits associated with the triggering push. We build semi-supervised Positive and Unlabeled (PU) learning models for seven Apache projects. The models achieve precision from 0.70 to 0.88, recall from 0.30 to 1.00, F1-score from 0.44 to 0.91, and AUC from 0.63 to 0.97. Feature importance analysis shows that CI latency, repeated error messages, and the number of preceding comments are useful indicators of unrelated build failures. These results show that PU learning can help developers identify build failures that are unlikely to be caused by their current push.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically studies unrelated build failures in CI systems across seven Apache projects, analyzing 77,354 build failures to show that developers spend a median of 4 hours determining relatedness. It performs document analysis on 371 sampled unrelated failures (from 10,316 candidates), finding that unrelated test failures comprise 20% of cases. Using 33 features from issue reports, comments, and commits, it trains per-project semi-supervised PU learning models that achieve precision 0.70-0.88, recall 0.30-1.00, F1 0.44-0.91, and AUC 0.63-0.97, with feature importance analysis identifying CI latency, repeated error messages, and preceding comments as useful predictors.
Significance. If the models hold, the work offers a practical way to reduce wasted developer time on non-actionable CI failures by leveraging observable artifacts and PU learning to handle limited labels. Strengths include concrete metrics on real Apache project data, identification of actionable features, and addressing a common CI pain point with semi-supervised methods. The per-project evaluation and variance analysis provide a starting point for tool support, though broader impact depends on addressing generalizability.
major comments (3)
- [Evaluation] The evaluation of the PU learning models reports recall ranging from 0.30 to 1.00 across projects (with F1 as low as 0.44). This variance indicates that the fixed 33-feature set may fail to capture project-specific failure patterns, especially since models are trained and evaluated separately per project without any cross-project transfer or held-out validation experiments.
- [Methodology] The sampling of 371 confirmed unrelated build failures from 10,316 potentially unrelated ones, combined with the PU learning setup, requires explicit details on labeling criteria, inter-rater agreement, and validation of PU assumptions (e.g., the positive-unlabeled distribution). Without these, the representativeness of the labeled set and potential sampling bias cannot be assessed, directly affecting the reliability of the reported metrics.
- [Results] The central claim that the approach can help developers identify unrelated failures rests on per-project models; however, the absence of cross-project generalization tests means the headline performance numbers do not establish applicability to new projects or varying CI conditions, as noted by the wide metric ranges.
minor comments (1)
- [Abstract] The abstract mentions the 33 features and 371 cases but provides limited transparency on the exact feature extraction process or sampling strategy; expanding this would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our approach where appropriate and indicating planned revisions to improve clarity and completeness.
read point-by-point responses
-
Referee: [Evaluation] The evaluation of the PU learning models reports recall ranging from 0.30 to 1.00 across projects (with F1 as low as 0.44). This variance indicates that the fixed 33-feature set may fail to capture project-specific failure patterns, especially since models are trained and evaluated separately per project without any cross-project transfer or held-out validation experiments.
Authors: We agree that the observed variance in recall and F1 scores reflects real differences in project-specific CI failure patterns, which motivated our per-project modeling strategy rather than a single global model. The 33 features were selected as commonly available signals across Apache projects to enable practical application. We will revise the evaluation section to include an explicit discussion of this variance as a key finding and its implications. Additionally, we will add a cross-project leave-one-out experiment to quantify transfer performance and report it in the revised manuscript. revision: yes
-
Referee: [Methodology] The sampling of 371 confirmed unrelated build failures from 10,316 potentially unrelated ones, combined with the PU learning setup, requires explicit details on labeling criteria, inter-rater agreement, and validation of PU assumptions (e.g., the positive-unlabeled distribution). Without these, the representativeness of the labeled set and potential sampling bias cannot be assessed, directly affecting the reliability of the reported metrics.
Authors: We will expand the methodology and document analysis sections to provide explicit labeling criteria, describing how unrelatedness was determined from issue reports, comments, and commit context for the sampled failures. We will also report the author review process used for the 371 cases and any agreement measures obtained. For the PU learning setup, we will add a dedicated subsection validating the assumptions by discussing the selection of positives from the 10,316 candidates and the nature of the unlabeled set, following established PU learning practices. These additions will allow readers to better assess representativeness and bias. revision: yes
-
Referee: [Results] The central claim that the approach can help developers identify unrelated failures rests on per-project models; however, the absence of cross-project generalization tests means the headline performance numbers do not establish applicability to new projects or varying CI conditions, as noted by the wide metric ranges.
Authors: We acknowledge that the wide metric ranges (particularly recall 0.30-1.00) indicate limited direct generalizability, and our claims are scoped to the seven studied projects where per-project models can be trained on historical data. The central contribution is demonstrating that PU learning on observable CI artifacts can reduce wasted effort within such projects. We will revise the results and threats-to-validity sections to more prominently state this scope as a limitation and to discuss the conditions under which the approach is expected to apply. No claim of universal applicability is made in the current manuscript. revision: yes
Circularity Check
No circularity: empirical PU models report performance on collected CI data without tautological reduction.
full rationale
The paper collects 77,354 build failures, manually confirms 371 unrelated cases from a 10,316 sample, extracts 33 observable features from issues/commits/comments, and trains per-project PU classifiers whose precision/recall/F1/AUC values are measured directly on that labeled data. No derivation step equates a claimed prediction to its own fitted inputs by construction, no uniqueness theorem or ansatz is smuggled via self-citation, and the central results remain independent empirical measurements rather than definitional restatements. The observed metric variance reflects data characteristics, not circular logic.
Axiom & Free-Parameter Ledger
free parameters (1)
- PU learning model hyperparameters
axioms (1)
- domain assumption The sampled 371 unrelated failures and 10,316 potentially unrelated cases are representative of all CI build failures.
Reference graph
Works this paper leans on
-
[1]
Empirical analysis of practitioners’ perceptions of test flakiness factors
AHMAD, A., LEIFLER, O.,ANDSANDAHL, K. Empirical analysis of practitioners’ perceptions of test flakiness factors. Software Testing, Verificationand Reliability 31, 8 (2021), e1791
work page 2021
-
[2]
AJIBODE, A., BANGASH, A. A., COGO, F. R., ADAMS, B.,ANDHASSAN, A. E. Towards se- mantic versioning of open pre-trained language model releases on hugging face. Empirical Software Engineering 30, 3 (2025), 1–63
work page 2025
-
[3]
ARACHCHI, S.,ANDPERERA, I. Continuous integration and continuous delivery pipeline automa- tion for agile software project management. In 2018 Moratuwa Engineering Research Conference (MERCon) (2018), IEEE, pp. 156–161
work page 2018
-
[4]
Deflaker: Automatically detecting flaky tests
BELL, J., LEGUNSEN, O., HILTON, M., ELOUSSI, L., YUNG, T.,ANDMARINOV, D. Deflaker: Automatically detecting flaky tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), IEEE, pp. 433–444
work page 2018
-
[5]
BERNARDO, J. H.,DACOSTA, D. A.,ANDKULESZA, U. Studying the impact of adopting con- tinuous integration on the delivery time of pull requests. In Proceedings of the 15th International Conference on Mining Software Repositories (2018), pp. 131–141
work page 2018
-
[6]
BOWEN, G. A. Document analysis as a qualitative research method. Qualitative research journal (2009)
work page 2009
- [7]
-
[8]
CHEN, B., CHEN, L., ZHANG, C.,ANDPENG, X. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. In Proceedings of the 35th IEEE/ACM international conference on automated software engineering (2020), pp. 42–53
work page 2020
-
[9]
C ¸ORBACIO ˘GLU, S ¸ . K.,ANDAKSEL, G. Receiver operating characteristic curve analysis in diag- nostic accuracy studies: A guide to interpreting the area under the curve value. Turkish Journal of Emergency Medicine 23, 4 (2023), 195
work page 2023
-
[10]
P., MISAROS, M., GOTA, D.,ANDMICLEA, L
DONCA, I.-C., STAN, O. P., MISAROS, M., GOTA, D.,ANDMICLEA, L. Method for continuous integration and deployment using a pipeline generator for agile software projects. Sensors 22, 12 (2022), 4637
work page 2022
-
[11]
DUVALL, P. M., MATYAS, S.,ANDGLOVER, A. Continuous integration: improving software quality and reducing risk. Pearson Education, 2007
work page 2007
-
[12]
Strength of evidence in systematic reviews in software engineering
DYB ˚A, T.,ANDDINGSØYR, T. Strength of evidence in systematic reviews in software engineering. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement (2008), pp. 178–187
work page 2008
-
[13]
Understanding flaky tests: The developer’s perspective
ECK, M., PALOMBA, F., CASTELLUCCIO, M.,ANDBACCHELLI, A. Understanding flaky tests: The developer’s perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2019), pp. 830–840
work page 2019
-
[14]
EHSAN, O., HASSAN, S., MEZOUAR, M. E.,ANDZOU, Y. An empirical study of developer discus- sions in the gitter platform. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 1 (2020), 1–39. 38 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar
work page 2020
-
[15]
ELMEZOUAR, M.,DACOSTA, D. A., GERMAN, D. M.,ANDZOU, Y. Exploring the use of chatrooms by developers: An empirical study on slack and gitter. IEEE Transactions on Software Engineering 48, 10 (2021), 3988–4001
work page 2021
-
[16]
ELAZHARY, O., WERNER, C., LI, Z. S., LOWLIND, D., ERNST, N. A.,ANDSTOREY, M.-A. Uncovering the benefits and challenges of continuous integration practices. IEEE Transactions on Software Engineering 48, 7 (2021), 2570–2583
work page 2021
-
[17]
Techniques for improving regression testing in continuous integration development environments
ELBAUM, S., ROTHERMEL, G.,ANDPENIX, J. Techniques for improving regression testing in continuous integration development environments. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (2014), pp. 235–245
work page 2014
-
[18]
Learning classifiers from only positive and unlabeled data
ELKAN, C.,ANDNOTO, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), pp. 213–220
work page 2008
-
[19]
Determining flaky tests from test failures
ELOUSSI, L. Determining flaky tests from test failures
-
[20]
FELIDR ´E, W., FURTADO, L.,DACOSTA, D. A., CARTAXO, B.,ANDPINTO, G. Continuous in- tegration theater. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (2019), IEEE, pp. 1–10
work page 2019
-
[21]
Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement
FORMAN, G.,ANDSCHOLZ, M. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. Acm Sigkdd Explorations Newsletter 12, 1 (2010), 49–57
work page 2010
- [22]
-
[23]
H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R
FUSILIER, D. H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R. G. Detecting positive and negative deceptive opinions using pu-learning. Information processing & management 51, 4 (2015), 433–443
work page 2015
-
[24]
Improving the robustness and efficiency of continuous integration and deployment
GALLABA, K. Improving the robustness and efficiency of continuous integration and deployment. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2019), IEEE, pp. 619–623
work page 2019
-
[25]
GHALEB, T. A., DACOSTA, D. A.,ANDZOU, Y. An empirical study of the long duration of continuous integration builds. Empirical Software Engineering 24, 4 (2019), 2102–2139
work page 2019
-
[26]
GHALEB, T. A.,DACOSTA, D. A., ZOU, Y.,ANDHASSAN, A. E. Studying the impact of noises in build breakage data. IEEE Transactions on Software Engineering 47, 9 (2019), 1998–2011
work page 2019
-
[27]
GHALEB, T. A., HASSAN, S.,ANDZOU, Y. Studying the interplay between the durations and breakages of continuous integration builds. IEEE Transactions on Software Engineering 49, 4 (2022), 2476–2497
work page 2022
-
[28]
An exploratory study of the pull-based software development model
GOUSIOS, G., PINZGER, M.,ANDDEURSEN, A.V. An exploratory study of the pull-based software development model. In Proceedings of the 36th international conference on software engineering (2014), pp. 345–355
work page 2014
-
[29]
GRAHAM, H.,ANDOWEN, L. Are there socioeconomic differentials in under-reporting of smoking in pregnancy? Tobacco Control 12, 4 (2003), 434–434
work page 2003
-
[30]
HANLEY, J. A.,ANDMCNEIL, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 1 (1982), 29–36
work page 1982
-
[31]
Tackling build failures in continuous integration
HASSAN, F. Tackling build failures in continuous integration. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2019), IEEE, pp. 1242–1245
work page 2019
-
[32]
A comparative study to benchmark cross- project defect prediction approaches
HERBOLD, S., TRAUTSCH, A.,ANDGRABOWSKI, J. A comparative study to benchmark cross- project defect prediction approaches. In Proceedings of the 40th international conference on software engineering (2018), pp. 1063–1063
work page 2018
-
[33]
HERTEL, G., NIEDNER, S.,ANDHERRMANN, S. Motivation of software developers in open source projects: an internet-based survey of contributors to the linux kernel. Research policy 32, 7 (2003), 1159–1177
work page 2003
-
[34]
Pu learning for matrix completion
HSIEH, C.-J., NATARAJAN, N.,ANDDHILLON, I. Pu learning for matrix completion. In International conference on machine learning (2015), PMLR, pp. 2445–2453
work page 2015
-
[35]
HUANG, A. unrelated-build-failures-empirical-studies, 2024.https://github.com/ckeys/ unrelated-build-failures-empirical-studies(Accessed: 2024-11-26)
work page 2024
-
[36]
HUANG, Y.,DACOSTA, D. A., ZHANG, F.,ANDZOU, Y. An empirical study on the issue reports with questions raised during the issue resolving process.Empirical Software Engineering 24, 2 (2019), 718–750
work page 2019
-
[37]
Evaluating learning algorithms: a classification perspective
JAPKOWICZ, N.,ANDSHAH, M. Evaluating learning algorithms: a classification perspective. Cam- bridge University Press, 2011
work page 2011
-
[38]
An ex- tended study of syntactic breaking changes in the wild
JAYASURIYA, D., OU, S., HEGDE, S., TERRAGNI, V., DIETRICH, J.,ANDBLINCOE, K. An ex- tended study of syntactic breaking changes in the wild. Empirical Software Engineering 30, 2 (2025), 1–45. Title Suppressed Due to Excessive Length 39
work page 2025
-
[39]
The impact of automated feature selection techniques on the interpretation of defect models
JIARPAKDEE, J., TANTITHAMTHAVORN, C.,ANDTREUDE, C. The impact of automated feature selection techniques on the interpretation of defect models. Empirical Software Engineering 25, 5 (2020), 3590–3638
work page 2020
-
[40]
A cost-efficient approach to building in continuous integration
JIN, X.,ANDSERVANT, F. A cost-efficient approach to building in continuous integration. In Proceedings of the ACM/IEEE 42nd International conference on software engineering (2020), pp. 13– 25
work page 2020
-
[41]
A systematic review of systematic review process research in software engineering
KITCHENHAM, B.,ANDBRERETON, P. A systematic review of systematic review process research in software engineering. Information and software technology 55, 12 (2013), 2049–2075
work page 2013
-
[42]
Support- ing continuous integration by code-churn based test selection
KNAUSS, E., STARON, M., MEDING, W., S ¨ODER, O., NILSSON, A.,ANDCASTELL, M. Support- ing continuous integration by code-churn based test selection. In 2015 IEEE/ACM 2nd International Workshop on Rapid Continuous Software Engineering (2015), IEEE, pp. 19–25
work page 2015
-
[43]
LABUSCHAGNE, A., INOZEMTSEVA, L.,ANDHOLMES, R. Measuring the cost of regression testing in practice: A study of java projects using continuous integration. In Proceedings of the 2017 11th joint meeting on foundations of software engineering (2017), pp. 821–830
work page 2017
-
[44]
When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla
LAMPEL, J., JUST, S., APEL, S.,ANDZELLER, A. When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021), pp. 1381–1392
work page 2021
-
[45]
Collaboration tools for global software engineering
LANUBILE, F., EBERT, C., PRIKLADNICKI, R.,ANDVIZCA ´INO, A. Collaboration tools for global software engineering. IEEE software 27, 2 (2010), 52
work page 2010
- [46]
-
[47]
LI, G., YANG, Y., WU, Z., CAO, T., LIU, Y.,ANDLI, Z. Weighted reward for reinforcement learning based test case prioritization in continuous integration testing. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC) (2021), IEEE, pp. 980–985
work page 2021
-
[48]
A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data
LI, W., GUO, Q.,ANDELKAN, C. A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data. IEEE transactions on geoscience and remote sensing 49, 2 (2010), 717–725
work page 2010
-
[49]
Learning to classify texts using positive and unlabeled data
LI, X.,ANDLIU, B. Learning to classify texts using positive and unlabeled data. In IJCAI (2003), vol. 3, Citeseer, pp. 587–592
work page 2003
-
[50]
LIBLIT, B., NAIK, M., ZHENG, A. X., AIKEN, A.,ANDJORDAN, M. I. Scalable statistical bug isolation. Acm Sigplan Notices 40, 6 (2005), 15–26
work page 2005
- [51]
-
[52]
L ¨UDERS, C. M., BOURAFFA, A.,ANDMAALEJ, W. Beyond duplicates: Towards understanding and predicting link types in issue tracking systems. In Proceedings of the 19th International Conference on Mining Software Repositories (2022), pp. 48–60
work page 2022
-
[53]
L ¨UDERS, C. M., PIETZ, T.,ANDMAALEJ, W. Automated detection of typed links in issue trackers. In 2022 IEEE 30th International Requirements Engineering Conference (RE) (2022), IEEE, pp. 26– 38
work page 2022
-
[54]
M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W
L ¨UDERS, C. M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W. Openreq issue link map: A tool to visualize issue links in jira. In 2019 IEEE 27th International Requirements Engineering Conference (RE) (2019), IEEE, pp. 492–493
work page 2019
-
[55]
An empirical analysis of flaky tests
LUO, Q., HARIRI, F., ELOUSSI, L.,ANDMARINOV, D. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering (2014), pp. 643–653
work page 2014
-
[56]
MACHALICA, M., SAMYLKIN, A., PORTH, M.,ANDCHANDRA, S. Predictive test selection. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2019), IEEE, pp. 91–100
work page 2019
-
[57]
MCINTOSH, S.,ANDKAMEI, Y. Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction. In Proceedings of the 40th International Conference on Software Engineering (2018), pp. 560–560
work page 2018
-
[58]
MCINTOSH, S., KAMEI, Y., ADAMS, B.,ANDHASSAN, A. E. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 21, 5 (2016), 2146–2189
work page 2016
-
[59]
Continuous integration and its tools
MEYER, M. Continuous integration and its tools. IEEE software 31, 3 (2014), 14–16
work page 2014
-
[60]
NGUYEN, M. N., LI, X.-L.,ANDNG, S.-K. Positive unlabeled learning for time series classifica- tion. In Twenty-Second International Joint Conference on Artificial Intelligence (2011), Citeseer. 40 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar
work page 2011
-
[61]
Towards language-independent brown build detection
OLEWICKI, D., NAYROLLES, M.,ANDADAMS, B. Towards language-independent brown build detection. In Proceedings of the 44th International Conference on Software Engineering (2022), pp. 2177–2188
work page 2022
-
[62]
PALOMBA, F.,ANDZAIDMAN, A. Notice of retraction: Does refactoring of test smells induce fixing flaky tests? In 2017 IEEE international conference on software maintenance and evolution (ICSME) (2017), IEEE, pp. 1–12
work page 2017
-
[63]
Continuous test suite failure prediction
PAN, C.,ANDPRADEL, M. Continuous test suite failure prediction. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021), pp. 553–565
work page 2021
-
[64]
PERRY, D. E., SIY, H. P.,ANDVOTTA, L. G. Parallel changes in large-scale software development: an observational case study.ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 3 (2001), 308–337
work page 2001
-
[65]
K., WANG, S., KAMEI, Y.,ANDHASSAN, A
RAJBAHADUR, G. K., WANG, S., KAMEI, Y.,ANDHASSAN, A. E. The impact of using regression models to build defect classifiers. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 135–145
work page 2017
-
[66]
RAUSCH, T., HUMMER, W., LEITNER, P.,ANDSCHULTE, S. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 345–355
work page 2017
-
[67]
SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. On the prediction of continuous integration build failures using search-based software engineering. InProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion (2020), pp. 313–314
work page 2020
-
[68]
SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. Predicting continuous integration build failures using evolutionary search. Information and Software Technology 128 (2020), 106392
work page 2020
-
[69]
Learning ci configuration correctness for early build feedback
SANTOLUCITO, M., ZHANG, J., ZHAI, E., CITO, J.,ANDPISKAC, R. Learning ci configuration correctness for early build feedback. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022), IEEE, pp. 1006–1017
work page 2022
-
[70]
SANTOS, J., ALENCAR DACOSTA, D.,ANDKULESZA, U. Investigating the impact of continu- ous integration practices on the productivity and quality of open-source projects. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (2022), pp. 137–147
work page 2022
-
[71]
A., MCINTOSH, S.,ANDKULESZA, U
SANTOS, J.,DACOSTA, D. A., MCINTOSH, S.,ANDKULESZA, U. On the need to monitor contin- uous integration practices. Empirical Software Engineering 30, 5 (2025), 125
work page 2025
-
[72]
SHAHIN, M., BABAR, M. A.,ANDZHU, L. Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices. IEEE Access 5 (2017), 3909–3943
work page 2017
-
[73]
Understanding and improving regression test selection in con- tinuous integration
SHI, A., ZHAO, P.,ANDMARINOV, D. Understanding and improving regression test selection in con- tinuous integration. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE) (2019), IEEE, pp. 228–238
work page 2019
-
[74]
Shake it! detecting flaky tests caused by concur- rency with shaker
SILVA, D., TEIXEIRA, L.,AND D’AMORIM, M. Shake it! detecting flaky tests caused by concur- rency with shaker. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020), IEEE, pp. 301–311
work page 2020
-
[75]
SOARES, E., SIZILIO, G., SANTOS, J.,DACOSTA, D. A.,ANDKULESZA, U. The effects of continuous integration on software development: a systematic literature review. Empirical Software Engineering 27, 3 (2022), 78
work page 2022
-
[76]
TANTITHAMTHAVORN, C., HASSAN, A. E.,ANDMATSUMOTO, K. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering 46, 11 (2018), 1200–1219
work page 2018
-
[77]
Verifying the selected completely at random assumption in positive-unlabeled learning
TEISSEYRE, P., FURMA ´NCZYK, K.,ANDMIELNICZUK, J. Verifying the selected completely at random assumption in positive-unlabeled learning. arXiv preprint arXiv:2404.00145 (2024)
-
[78]
A container-based infrastructure for fuzzy- driven root causing of flaky tests
TERRAGNI, V., SALZA, P.,ANDFERRUCCI, F. A container-based infrastructure for fuzzy- driven root causing of flaky tests. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) (2020), IEEE, pp. 69–72
work page 2020
-
[79]
An empirical study of flaky tests in android apps
THORVE, S., SRESHTHA, C.,ANDMENG, N. An empirical study of flaky tests in android apps. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2018), IEEE, pp. 534–538
work page 2018
-
[80]
TURHAN, B., MENZIES, T., BENER, A. B.,ANDDISTEFANO, J. On the relative value of cross- company and within-company data for defect prediction. Empirical Software Engineering 14, 5 (2009), 540–578
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.